diff --git a/.copilot-tracking/IMPLEMENTATION_ARCHIVE.md b/.copilot-tracking/IMPLEMENTATION_ARCHIVE.md
deleted file mode 100644
index 4b68ab6..0000000
--- a/.copilot-tracking/IMPLEMENTATION_ARCHIVE.md
+++ /dev/null
@@ -1,58 +0,0 @@
-# Implementation Archive
-
-This file contains completed implementation history for reference. See IMPLEMENTATION_PLAN.md for current work.
-
-**Archive Date**: 2026-01-23
-
-## All Priority 0-2 Features: ✅ COMPLETE
-
-All critical security, data integrity, and user experience features have been successfully implemented and tested.
-
-### Completed Implementation (2026-01-23)
-
-**Security (Priority 0):**
-- DoS Prevention with rate limiting
-- PII Detection with warnings
-- XSS Sanitization with shared validation utilities
-
-**Data Integrity (Priority 1):**
-- Batch Validation with structured error codes
-- Duplicate Detection with normalized text comparison
-- Assignment Error Feedback with conflict details
-
-**User Experience (Priority 2):**
-- Explorer State Preservation (URL-based filters)
-- Keyword Search (full-text search)
-- Tag Filtering (tri-state: include/exclude/neutral)
-- Assignment Takeover (admin force-assignment)
-- Explorer Sorting (including tag count)
-- Modal Keyboard Handling
-- Inspection Performance (session cache)
-
-**Technical Debt (Priority 3):**
-- Frontend Code Quality (removed skipped tests, fixed tag glossary isolation)
-- Backend Code Cleanup (removed print statements, rate limiter test isolation)
-- CI Code Quality Gates (type checker clean, 0 errors)
-- Pre-commit hooks for frontend
-
-**Documentation (Priority 4):**
-- Documentation Infrastructure (MkDocs with Material theme)
-- Documentation Content (guides, API docs, architecture docs)
-- Tag Glossary (tooltips, full view, inline editing for custom tags)
-
-**Performance & Optimization:**
-- Cosmos Indexing Policy optimization (ready for deployment)
-- Partial Updates optimization (patch operations)
-- Query Performance Monitoring infrastructure
-
-### Test Status (Archive Date: 2026-01-23)
-
-- **Backend**: 267 unit tests passing, 138 integration tests passing
-- **Frontend**: 237 tests passing
-- **Type Checking**: All checks passed (backend ty, frontend tsc)
-
-### Architecture Notes
-
-- **Architecture**: Well-structured with 8 specialized services (Assignment, Curation, Search, TagRegistry, Chat, Snapshot, Validation, Inference)
-- **Dependency Injection**: Pragmatic hybrid approach (FastAPI Depends, Container singleton, Pydantic Settings)
-- **Code Quality**: No print statements, no skipped tests, type-safe with zero type checker errors
diff --git a/.copilot-tracking/changes/20260116-export-pipeline-design-changes.md b/.copilot-tracking/changes/20260116-export-pipeline-design-changes.md
deleted file mode 100644
index 636a3c6..0000000
--- a/.copilot-tracking/changes/20260116-export-pipeline-design-changes.md
+++ /dev/null
@@ -1,45 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Export pipeline design
-
-**Related Plan**: 20260116-export-pipeline-design-plan.instructions.md
-**Implementation Date**: 2026-01-16
-
-## Summary
-
-Planned updates for the export pipeline design implementation.
-
-## Changes
-
-### Added
-
-### Modified
-
-* docs/computed-tags-design.md - Documented the snapshot export baseline contract for pipeline compatibility.
-* docs/computed-tags-design.md - Defined the v1 export pipeline API surface and defaults.
-* docs/computed-tags-design.md - Updated the pipeline entry point to reuse the snapshot POST route.
-* docs/computed-tags-design.md - Added processor and formatter interface rules with determinism guidance.
-* docs/computed-tags-design.md - Documented registries, config env vars, and container wiring.
-* docs/computed-tags-design.md - Added execution flow, delivery modes, and initial formatter output shapes.
-* docs/computed-tags-design.md - Documented export storage interface and Blob configuration strategy.
-* docs/computed-tags-design.md - Selected backend streaming delivery for Blob-hosted artifacts.
-* docs/computed-tags-design.md - Updated Blob authentication to managed identity only with local export warning.
-* docs/computed-tags-design.md - Added test strategy and rollout guidance for the export pipeline.
-
-### Removed
-
-## Release Summary
-
-**Total Files Affected**: 3
-
-### Files Modified (3)
-
-* docs/computed-tags-design.md - Added export pipeline design details across baseline, interfaces, execution, storage, and testing.
-* .copilot-tracking/changes/20260116-export-pipeline-design-changes.md - Recorded implementation progress and summaries.
-* .copilot-tracking/plans/20260116-export-pipeline-design-plan.instructions.md - Marked all phases and tasks complete.
-
-### Dependencies & Infrastructure
-
-* **New Dependencies**: None
-* **Updated Dependencies**: None
-* **Infrastructure Changes**: None
-* **Configuration Updates**: None
diff --git a/.copilot-tracking/changes/20260116-export-pipeline-implementation-changes.md b/.copilot-tracking/changes/20260116-export-pipeline-implementation-changes.md
deleted file mode 100644
index eeb01f3..0000000
--- a/.copilot-tracking/changes/20260116-export-pipeline-implementation-changes.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-title: Export pipeline implementation changes
-description: Tracking updates for the export pipeline implementation work.
-ms.date: 2026-01-16
----
-
-<!-- markdownlint-disable-file -->
-# Release Changes: Export pipeline implementation
-
-**Related Plan**: 20260116-export-pipeline-implementation-plan.instructions.md
-**Implementation Date**: 2026-01-16
-
-## Summary
-
-Tracking updates for the export pipeline implementation tasks.
-
-## Changes
-
-### Added
-
-* backend/app/exports/__init__.py - Introduced the export pipeline package marker.
-* backend/app/exports/models.py - Added request models for snapshot export defaults.
-* backend/app/exports/registry.py - Added processor and formatter registries with name resolution helpers.
-* backend/app/exports/processors/__init__.py - Added export processor package marker.
-* backend/app/exports/processors/merge_tags.py - Added merge tags export processor.
-* backend/app/exports/formatters/__init__.py - Added export formatter package marker.
-* backend/app/exports/formatters/json_items.py - Added JSON items export formatter.
-* backend/app/exports/formatters/json_snapshot_payload.py - Added JSON snapshot payload formatter.
-* backend/tests/unit/test_export_registry.py - Added unit tests for export registry behavior.
-* backend/tests/unit/test_export_formatters.py - Added unit tests for export formatter outputs.
-* backend/tests/unit/test_export_processors.py - Added unit tests for export processor behavior.
-* backend/app/exports/storage/__init__.py - Added export storage package marker.
-* backend/app/exports/storage/base.py - Added export storage interface protocol.
-* backend/app/exports/storage/local.py - Added local filesystem export storage backend.
-* backend/app/exports/storage/blob.py - Added Azure Blob export storage backend.
-* backend/app/exports/pipeline.py - Added pipeline delivery helpers for attachments, streams, and artifacts.
-* backend/tests/unit/test_export_pipeline.py - Added unit tests for pipeline delivery behaviors.
-
-### Modified
-
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Marked Task 1.1 complete after verifying snapshot endpoint contracts.
-* backend/app/api/v1/ground_truths.py - Allowed optional snapshot request bodies while preserving legacy behavior.
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Marked Task 1.2 and Phase 1 as complete.
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Marked Task 2.1 complete after adding request models.
-* backend/app/core/config.py - Added export processor order setting for pipeline configuration.
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Marked Task 2.2 and Phase 2 as complete.
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Marked Tasks 3.1-3.2 and Phase 3 as complete.
-* backend/app/core/config.py - Added export storage settings and blob configuration validation.
-* backend/pyproject.toml - Added Azure Blob SDK dependency.
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Marked Tasks 4.1-4.3 and Phase 4 as complete.
-* backend/app/container.py - Wired export registries, storage, and pipeline into the container.
-* backend/app/exports/registry.py - Added formatter factory support for contextual formatting.
-* backend/app/exports/formatters/json_snapshot_payload.py - Preserved legacy filters by avoiding injected dataset names.
-* backend/app/services/snapshot_service.py - Delegated snapshot payloads and artifacts to the export pipeline.
-* backend/app/api/v1/ground_truths.py - Routed snapshot POST requests through the pipeline with validation.
-* backend/tests/unit/test_snapshot_service.py - Updated snapshot service tests for pipeline wiring.
-* backend/tests/unit/test_export_registry.py - Updated registry tests for formatter creation.
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Marked Tasks 5.1-5.3 and Phase 5 as complete.
-
-### Removed
-
-* .copilot-tracking/prompts/implement-export-pipeline-implementation.prompt.md - Removed implementation prompt after completing tasks.
-
-## Release Summary
-
-Total files affected: 26.
-
-### Files Created (17)
-
-* backend/app/exports/__init__.py - Export pipeline package marker
-* backend/app/exports/models.py - Snapshot export request models
-* backend/app/exports/registry.py - Export processor and formatter registries
-* backend/app/exports/processors/__init__.py - Export processors package marker
-* backend/app/exports/processors/merge_tags.py - Merge tags export processor
-* backend/app/exports/formatters/__init__.py - Export formatters package marker
-* backend/app/exports/formatters/json_items.py - JSON items formatter
-* backend/app/exports/formatters/json_snapshot_payload.py - Snapshot payload formatter
-* backend/app/exports/storage/__init__.py - Export storage package marker
-* backend/app/exports/storage/base.py - Export storage protocol
-* backend/app/exports/storage/local.py - Local export storage backend
-* backend/app/exports/storage/blob.py - Azure Blob export storage backend
-* backend/app/exports/pipeline.py - Export pipeline delivery helpers
-* backend/tests/unit/test_export_registry.py - Export registry unit tests
-* backend/tests/unit/test_export_formatters.py - Export formatter unit tests
-* backend/tests/unit/test_export_processors.py - Export processor unit tests
-* backend/tests/unit/test_export_pipeline.py - Export pipeline delivery unit tests
-
-### Files Modified (8)
-
-* .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md - Task progress updates
-* .copilot-tracking/changes/20260116-export-pipeline-implementation-changes.md - Change tracking updates
-* backend/app/api/v1/ground_truths.py - Snapshot pipeline routing and validation
-* backend/app/core/config.py - Export settings and blob validation
-* backend/app/container.py - Export pipeline wiring
-* backend/app/services/snapshot_service.py - Pipeline-backed snapshot logic
-* backend/pyproject.toml - Azure Blob SDK dependency
-* backend/tests/unit/test_snapshot_service.py - Pipeline-aware snapshot tests
-
-### Files Removed (1)
-
-* .copilot-tracking/prompts/implement-export-pipeline-implementation.prompt.md - Cleanup prompt file
-
-### Dependencies & Infrastructure
-
-* New dependency azure-storage-blob
-* Export storage settings and validation in backend/app/core/config.py
-
-### Deployment Notes
-
-Ensure the blob backend settings are configured before switching `GTC_EXPORT_STORAGE_BACKEND` to `blob`.
diff --git a/.copilot-tracking/changes/20260123-assignment-error-feedback-changes.md b/.copilot-tracking/changes/20260123-assignment-error-feedback-changes.md
deleted file mode 100644
index fddbc03..0000000
--- a/.copilot-tracking/changes/20260123-assignment-error-feedback-changes.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Assignment Error Feedback
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 1 - Data Integrity)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Enhanced assignment conflict responses with structured payload that includes current assignee information (`assignedTo`, `assignedAt`). When attempting to assign an item already assigned to another user, the 409 response now provides structured JSON with assignment details instead of just a plain error message, enabling better UI feedback and conflict resolution workflows.
-
-## Changes
-
-### Added
-
-* `backend/app/core/errors.py` - Added `AssignmentConflictError` exception class with `assigned_to` and `assigned_at` attributes
-* `backend/tests/integration/test_assignments_assign_single_cosmos.py` - Added test assertions to verify structured 409 payload (lines 87-105)
-
-### Modified
-
-* `backend/app/services/assignment_service.py` - Import `AssignmentConflictError` from core.errors module
-* `backend/app/services/assignment_service.py` - Changed line 207 to raise `AssignmentConflictError` instead of `ValueError` with assignment details (assigned_to, assigned_at)
-* `backend/app/api/v1/assignments.py` - Import `AssignmentConflictError` and `JSONResponse`
-* `backend/app/api/v1/assignments.py` - Updated `assign_item` endpoint return type to `GroundTruthItem | JSONResponse` with `response_model=None`
-* `backend/app/api/v1/assignments.py` - Added `except AssignmentConflictError` handler (lines 280-293) that returns structured JSON response with `detail`, `assignedTo`, and `assignedAt` fields
-* `backend/tests/integration/test_assignments_assign_single_cosmos.py` - Updated test docstring and added assertions for structured response verification
-
-### Removed
-
-* None
-
-## Release Summary
-
-**Total files affected**: 4 files modified
-
-**API Changes**:
-- `POST /v1/assignments/{dataset}/{bucket}/{item_id}/assign` now returns structured JSON on 409 conflict:
-  ```json
-  {
-    "detail": "Item is already assigned to another user",
-    "assignedTo": "user@example.com",
-    "assignedAt": "2026-01-23T12:34:56.789012+00:00"
-  }
-  ```
-
-**Error Response Structure**:
-- `detail` (str): Human-readable error message
-- `assignedTo` (str): Email/ID of user who currently has the assignment
-- `assignedAt` (str | undefined): ISO 8601 timestamp of when assignment was made (if available)
-
-**Exception Hierarchy**:
-- New `AssignmentConflictError` exception carries structured data through service → API layer
-- Replaces generic `ValueError("Item is already assigned to another user")`
-- Enables consistent structured responses across all assignment conflict scenarios
-
-**Testing**: 
-- All 225 unit tests passing
-- Integration test updated to verify structured response fields
-- Type checking passes with `ty check`
-
-**Backward Compatibility**:
-- HTTP status code remains 409 (no change)
-- Response structure changed from simple `detail` string to structured JSON object
-- Clients expecting only `detail` field will still work but won't utilize new assignment info
-- Frontend should update to display `assignedTo` information in conflict dialogs
-
-**Deployment Notes**: 
-- No database migrations required
-- No configuration changes required
-- Backend-only changes
-- Recommended: Update frontend to display assignee info when assignment conflicts occur
-- Enables future "Assignment Takeover" feature (force parameter) with better UX
diff --git a/.copilot-tracking/changes/20260123-assignment-takeover-changes.md b/.copilot-tracking/changes/20260123-assignment-takeover-changes.md
deleted file mode 100644
index 32ed22f..0000000
--- a/.copilot-tracking/changes/20260123-assignment-takeover-changes.md
+++ /dev/null
@@ -1,85 +0,0 @@
-# Release Changes: Assignment Takeover
-
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Implemented assignment takeover functionality allowing users with `admin` or `team-lead` roles to forcefully reassign ground truth items that are currently assigned to another user in draft status. This addresses the operational need to redistribute work when team members are unavailable.
-
-## Changes
-
-### Added
-
-* `backend/tests/integration/test_assignments_assign_single_cosmos.py` - Added 5 new integration tests for force assignment scenarios:
-  - `test_force_assign_without_role_returns_403` - Validates permission denial for regular users
-  - `test_force_assign_with_admin_role_succeeds` - Validates successful force assignment with admin role
-  - `test_force_assign_with_team_lead_role_succeeds` - Validates successful force assignment with team-lead role
-  - `test_force_assign_unassigned_item_succeeds` - Validates force assignment on unassigned items (no-op)
-* `backend/tests/integration/conftest.py` - Added `admin_headers` and `team_lead_headers` fixtures with proper role claims in X-MS-CLIENT-PRINCIPAL header
-* `backend/app/api/v1/assignments.py` - Added `AssignmentItemRequest` Pydantic model with `force: bool` field for request body
-
-### Modified
-
-* `backend/app/api/v1/assignments.py` - Updated `/v1/assignments/{dataset}/{bucket}/{item_id}/assign` endpoint:
-  - Accepts optional request body with `force` parameter
-  - Passes `user.roles` to service layer for permission checking
-  - Handles `PermissionError` and returns HTTP 403 Forbidden
-  - Added comprehensive docstring explaining force assignment behavior
-* `backend/app/services/assignment_service.py` - Enhanced `assign_single_item()` method:
-  - Added `force: bool = False` and `user_roles: list[str] | None = None` parameters
-  - Added `_has_takeover_permission(roles: list[str]) -> bool` helper method
-  - Implemented force assignment logic that clears previous assignment before reassigning
-  - Added cleanup of previous assignment document after successful force takeover
-  - Enhanced logging to record force-assign events with previous assignee information
-  - Added proper error handling for assignment document cleanup failures
-* `backend/app/api/v1/ground_truths.py` - Fixed bug in duplicate detection:
-  - Changed `page_size` parameter to `limit` (correct parameter name)
-  - Changed `sort_field` to `sort_by` (correct parameter name)
-  - Fixed tuple unpacking for `list_gt_paginated` return value
-  - Added try-except wrapper to gracefully handle NotImplementedError in unit tests
-
-## Technical Details
-
-### Authorization Model
-
-- Uses existing `UserContext.roles` from Azure AD claims
-- Checks for `admin` or `team-lead` role in the roles list
-- Returns HTTP 403 if force assignment attempted without proper role
-
-### Force Assignment Flow
-
-1. Service layer validates user has required role
-2. Stores previous assignee for cleanup
-3. Clears `assignedTo` and `assigned_at` fields from the item via `upsert_gt`
-4. Calls standard `assign_to` method to assign to new user
-5. Cleans up previous user's assignment document
-6. Logs force takeover event with previous and new assignee details
-
-### Error Handling
-
-- `PermissionError` raised if force=True without admin/team-lead role
-- `AssignmentConflictError` raised if force=False and item already assigned
-- Assignment document cleanup errors are logged but don't fail the request
-
-## Test Results
-
-- All 10 assignment integration tests pass
-- All 253 backend unit tests pass
-- New tests validate:
-  - Permission denial for non-privileged users (403)
-  - Successful force assignment with admin role
-  - Successful force assignment with team-lead role
-  - Force assignment on unassigned items (no-op)
-  - Assignment document cleanup
-
-## Deployment Notes
-
-- Backend changes only; frontend confirmation dialog deferred to separate implementation
-- No database migrations required
-- Compatible with existing assignment workflow
-- Azure AD app registration must define `admin` and `team-lead` roles in manifest
-
-## Related Files
-
-- Specification: `specs/assignment-takeover.md`
-- Implementation Plan: `IMPLEMENTATION_PLAN.md` (Priority 2 - User Experience)
diff --git a/.copilot-tracking/changes/20260123-backend-code-cleanup-changes.md b/.copilot-tracking/changes/20260123-backend-code-cleanup-changes.md
deleted file mode 100644
index d53be48..0000000
--- a/.copilot-tracking/changes/20260123-backend-code-cleanup-changes.md
+++ /dev/null
@@ -1,54 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Backend Code Cleanup
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 3 - Technical Debt & Code Quality)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Replaced the last remaining print statement in the backend codebase with proper structured logging. Converted `_to_doc` from a static method to an instance method to enable access to the class logger, improving error tracking and debugging capabilities. This completes the backend code cleanup initiative to eliminate debugging artifacts.
-
-## Changes
-
-### Added
-
-* None
-
-### Modified
-
-* `backend/app/adapters/repos/cosmos_repo.py` - Converted `_to_doc` method from `@staticmethod` to instance method (removed decorator, added `self` parameter)
-* `backend/app/adapters/repos/cosmos_repo.py` - Replaced `print(item.__repr__())` on line 401 with `self._logger.error(f"Document missing datasetName: {item!r}")`
-
-### Removed
-
-* None
-
-## Release Summary
-
-**Total files affected**: 1 file modified
-
-**Code Quality Improvements**:
-- Eliminated last remaining `print()` statement in `backend/app/` directory
-- Improved error tracking with structured logging using class logger
-- Better debugging capabilities with `logger.error()` instead of console output
-
-**Technical Details**:
-- `_to_doc` method signature changed from `_to_doc(item: GroundTruthItem)` to `_to_doc(self, item: GroundTruthItem)`
-- Method still called as `self._to_doc(item)` from lines 479 and 1138, so no call-site changes required
-- Error message now properly logged at ERROR level with formatted item representation
-
-**Testing**: 
-- All 226 unit tests passing
-- No test changes required (method signature compatible with existing usage)
-- Verified no remaining print statements with `grep -rn "print(" backend/app/`
-
-**Backward Compatibility**:
-- Internal refactoring only, no API changes
-- No behavior changes from external perspective
-- Better logging output for debugging production issues
-
-**Deployment Notes**: 
-- No database migrations required
-- No configuration changes required
-- No frontend changes required
-- Improved observability for datasetName validation errors
diff --git a/.copilot-tracking/changes/20260123-batch-validation-changes.md b/.copilot-tracking/changes/20260123-batch-validation-changes.md
deleted file mode 100644
index 6d33809..0000000
--- a/.copilot-tracking/changes/20260123-batch-validation-changes.md
+++ /dev/null
@@ -1,66 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Batch Validation Improvements
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 1 - Data Integrity)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Enhanced bulk import validation with structured error objects that provide programmatic error handling, row-level context, and field-specific feedback. The new error format includes error codes (INVALID_TAG, DUPLICATE_ID, CREATE_FAILED), 0-based index tracking, field names, and validation summaries with total/succeeded/failed counts.
-
-## Changes
-
-### Added
-
-* `backend/app/domain/models.py` - Added `BulkImportError` model with structured fields (index, item_id, field, code, message)
-* `backend/app/domain/models.py` - Added `ValidationSummary` model with total/succeeded/failed statistics
-
-### Modified
-
-* `backend/app/api/v1/ground_truths.py` - Updated `ImportBulkResponse` to use structured `BulkImportError` objects instead of plain strings; added `failed` count and `validation_summary` fields
-* `backend/app/services/validation_service.py` - Modified `validate_ground_truth_item` to return `BulkImportError` objects with index tracking; updated function signature to accept `item_index` parameter
-* `backend/app/services/validation_service.py` - Modified `validate_bulk_items` to pass item index to validator and return structured errors
-* `backend/app/api/v1/ground_truths.py` - Updated `import_bulk` endpoint to convert repository errors to structured format and build validation summary
-* `backend/tests/unit/test_bulk_import_tag_validation.py` - Updated test assertions to validate structured error objects (code, field, item_id, index, message)
-
-### Removed
-
-* None
-
-## Release Summary
-
-**Total files affected**: 4 files modified
-
-**API Changes**:
-- `ImportBulkResponse` now includes:
-  - `failed` (int): count of failed items
-  - `errors` (list[BulkImportError]): structured error objects instead of strings
-  - `validationSummary`: statistics with total/succeeded/failed counts
-- `BulkImportError` structure:
-  - `index` (int): 0-based position in request array
-  - `itemId` (str | null): ID of failed item
-  - `field` (str | null): field that caused error
-  - `code` (str): error code (INVALID_TAG, DUPLICATE_ID, CREATE_FAILED)
-  - `message` (str): human-readable description
-
-**Error Codes**:
-- `INVALID_TAG`: Tag doesn't exist in registry or violates format
-- `DUPLICATE_ID`: Item with this ID already exists (Cosmos 409)
-- `CREATE_FAILED`: Generic persistence failure
-
-**Testing**: 
-- All 11 bulk-related unit tests passing
-- Tag validation tests updated for structured error format
-- DoS prevention tests passing
-- Type checking passes with acceptable warnings
-
-**Backward Compatibility**:
-- Response structure changed but maintains same HTTP status codes
-- Clients expecting string errors will need to update to use structured objects
-- All other fields (imported, uuids, piiWarnings) remain unchanged
-
-**Deployment Notes**: 
-- No database migrations required
-- No configuration changes required
-- Backend-only changes
-- Clients consuming bulk import API should update error handling logic
diff --git a/.copilot-tracking/changes/20260123-explorer-state-preservation-changes.md b/.copilot-tracking/changes/20260123-explorer-state-preservation-changes.md
deleted file mode 100644
index cafc42b..0000000
--- a/.copilot-tracking/changes/20260123-explorer-state-preservation-changes.md
+++ /dev/null
@@ -1,102 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Explorer State Preservation
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 2 - User Experience, [FOUNDATION])
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Implemented URL-based filter state persistence for the QuestionsExplorer component. Users can now bookmark filtered views and filters persist across page reloads. This is a foundational feature that enables future enhancements like keyword search and tag filtering to be URL-addressable.
-
-## Changes
-
-### Added
-
-* `frontend/src/types/filters.ts` - Centralized filter type definitions (FilterState, FilterType, SortColumn, SortDirection)
-* `frontend/src/utils/filterUrlParams.ts` - URL parameter management utilities:
-  - `parseFilterStateFromUrl()` - Parse URL search params → FilterState
-  - `filterStateToUrlParams()` - Convert FilterState → URLSearchParams
-  - `updateUrlWithoutReload()` - Update browser URL via History API without reload
-  - `getCurrentSearch()` - Get current search parameters
-* `frontend/tests/unit/utils/filterUrlParams.test.ts` - 31 comprehensive tests covering:
-  - URL parsing (default values, valid parameters, invalid parameters)
-  - Filter state to URL conversion
-  - URL updates without page reload
-  - Edge cases (empty tags, special characters, missing params)
-
-### Modified
-
-* `frontend/src/components/app/QuestionsExplorer.tsx` - Integrated URL state persistence:
-  - Added imports for filter utilities and types
-  - Initialize filter state from URL on component mount (useEffect with empty deps)
-  - Sync URL when appliedFilter changes (useEffect with appliedFilter dependency)
-  - Preserved all existing functionality and component behavior
-  - No breaking changes to component API
-
-### Removed
-
-* None
-
-## Release Summary
-
-**Total files affected**: 4 files (3 added, 1 modified)
-
-**User Experience Improvements**:
-- **Bookmarkable Views**: Users can save and share specific filter combinations via URL
-- **Filter Persistence**: Page reload maintains filter state (no more lost work)
-- **Clean URLs**: Only non-default parameters included in URL
-- **Type-Safe**: Full validation of all URL parameters with fallback to defaults
-
-**Technical Details**:
-- Uses browser History API (`pushState`) to update URL without page reload
-- URL parameters: `status`, `dataset`, `tags`, `itemId`, `refUrl`, `sortColumn`, `sortDirection`
-- Tag array encoding: comma-separated list (e.g., `tags=important,validated`)
-- Special character handling: proper URL encoding/decoding for refUrl and other params
-- Default values: `status=all`, `dataset=all`, `tags=[]`, etc.
-
-**Example URLs**:
-```
-Simple:     /?status=approved
-Complex:    /?status=approved&dataset=prod&tags=important,validated&sortColumn=refs
-Item ID:    /?itemId=item-123
-Reference:  /?refUrl=https%3A%2F%2Fexample.com%2Fpage
-Default:    / (no params shown)
-```
-
-**Testing**: 
-- All 226/232 frontend tests passing (6 pre-existing skipped tests unrelated)
-- 31 new tests for URL filter persistence utilities (all passing)
-- TypeScript build: ✅ SUCCESS
-- Vite production build: ✅ SUCCESS
-- No performance regression detected
-
-**Backward Compatibility**:
-- Fully backward compatible - URLs without parameters work as before
-- Component API unchanged - no breaking changes for consumers
-- Graceful fallback to default values for invalid URL parameters
-- Existing filter behavior preserved exactly
-
-**Architecture Notes**:
-- **Foundation Feature**: Enables future keyword search and tri-state tag filtering to be URL-addressable
-- Clean separation of concerns: filter types → utils → component integration
-- Reusable utilities for future URL state management needs
-- Comprehensive test coverage (31 tests) ensures reliability
-
-**Deployment Notes**: 
-- No database migrations required
-- No configuration changes required
-- No backend changes required
-- Frontend-only enhancement
-- Deploy with standard frontend build pipeline
-- Recommend testing bookmarked URLs after deployment
-
-**Known Limitations**:
-- URL does not include pagination state (currentPage, itemsPerPage) - by design, filters are more important to preserve
-- Very long tag lists may make URLs unwieldy (mitigated by comma encoding)
-- Browser history will contain filter changes (user can use back/forward to navigate filter history)
-
-**Future Enhancements Enabled**:
-- Keyword search can now be added to URL (unlocked by this foundation)
-- Tri-state tag filtering can be URL-encoded (unlocked by this foundation)
-- Analytics tracking of popular filter combinations via URL analysis
-- Deep linking into specific views from external tools/dashboards
diff --git a/.copilot-tracking/changes/20260123-has-answer-sort-documentation-changes.md b/.copilot-tracking/changes/20260123-has-answer-sort-documentation-changes.md
deleted file mode 100644
index c4e3a1e..0000000
--- a/.copilot-tracking/changes/20260123-has-answer-sort-documentation-changes.md
+++ /dev/null
@@ -1,60 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: has_answer Sort Field Documentation
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Code Quality)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Resolved TODO comment in `cosmos_repo.py` by documenting the design rationale for the `has_answer` sort field mapping. The TODO suggested revisiting why `has_answer` maps to `c.reviewedAt` in Cosmos DB queries. After investigation, this is the correct implementation given Cosmos DB limitations.
-
-The changes add comprehensive documentation explaining:
-1. Why `has_answer` uses `c.reviewedAt` as a placeholder in the ORDER BY clause
-2. That actual sorting happens in-memory where `has_answer` is computed as a boolean
-3. This design works around Cosmos DB's inability to sort by computed/derived fields
-
-## Changes
-
-### Modified
-
-* `backend/app/adapters/repos/cosmos_repo.py` - Replaced TODO with detailed documentation
-  * Line ~760: Added multi-line comment explaining the `has_answer` mapping rationale in `_build_secure_sort_clause`
-  * Line ~700: Added cross-reference comment in `_sort_key` method explaining in-memory sort implementation
-  * Both comments clarify that this is a deliberate design decision, not a bug or incomplete implementation
-
-## Technical Details
-
-**The Design Pattern:**
-
-Cosmos DB SQL doesn't support sorting by computed expressions like:
-```sql
-ORDER BY (c.answer IS NOT NULL AND LENGTH(c.answer) > 0)
-```
-
-**The Solution:**
-
-1. **Cosmos Query Level**: Use `c.reviewedAt` as a syntactically valid placeholder in ORDER BY
-2. **Python Level**: Perform actual sorting in `_sort_key` method using:
-   - Primary sort key: `has_answer` (1 if answer exists and non-empty, else 0)
-   - Secondary sort key: `reviewed_at` (or `updated_at` fallback)
-   - Tertiary sort key: `id` (for stable sorting)
-
-**Why This Works:**
-
-- Cosmos DB requires a valid ORDER BY clause for pagination/consistency
-- The placeholder doesn't affect correctness because Python re-sorts the results
-- This pattern is consistent with how `tag_count` sorting works (also in-memory)
-
-## Testing
-
-* All 26 cosmos_repo unit tests pass
-* Type checking clean with `ty check`
-* No functional changes, only documentation improvements
-
-## Release Summary
-
-Resolved code clarity issue by documenting existing correct implementation. No behavior changes.
-
-**Files affected**: 1 file modified
-**Tests**: All 267 backend unit tests passing
-**Type checking**: Zero errors
diff --git a/.copilot-tracking/changes/20260123-implementation-plan-summary-update.md b/.copilot-tracking/changes/20260123-implementation-plan-summary-update.md
deleted file mode 100644
index f45d6c4..0000000
--- a/.copilot-tracking/changes/20260123-implementation-plan-summary-update.md
+++ /dev/null
@@ -1,33 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Implementation Plan Summary Update
-
-**Related Plan**: IMPLEMENTATION_PLAN.md
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Updated the IMPLEMENTATION_PLAN.md summary section to accurately reflect the completion status of all priority 0-2 features. Replaced the outdated "Suggested Implementation Sequence" section with a comprehensive "Implementation Status Summary" that clearly shows all completed features and the few remaining optional items.
-
-## Changes
-
-### Modified
-
-* `IMPLEMENTATION_PLAN.md` - Replaced lines 520-568 with new "Implementation Status Summary" section:
-  - Listed all completed features by priority (Security, Data Integrity, UX, Technical Debt, Documentation, Performance)
-  - Updated test counts: 267 backend unit tests, 138 integration tests, 237 frontend tests
-  - Clarified that only 3 optional items remain (pre-commit hooks, CI enhancements, production profiling)
-  - Removed outdated notes about incomplete features
-  - Added "Ready for Production" status indicator
-
-## Release Summary
-
-**Files Modified**: 1
-**Documentation Status**: Implementation plan now accurately reflects 100% completion of critical features
-
-## Deployment Notes
-
-This is a documentation-only change with no code or functional changes. The implementation plan now serves as an accurate record of completed work rather than a todo list.
-
-## Learnings
-
-When an implementation plan grows large and most items are complete, the summary section becomes stale quickly. Periodic cleanup keeps the plan useful and accurate for future reference.
diff --git a/.copilot-tracking/changes/20260123-pre-commit-hooks-changes.md b/.copilot-tracking/changes/20260123-pre-commit-hooks-changes.md
deleted file mode 100644
index fcb3de2..0000000
--- a/.copilot-tracking/changes/20260123-pre-commit-hooks-changes.md
+++ /dev/null
@@ -1,88 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Pre-Commit Hooks Implementation
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 3)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Implemented pre-commit quality checks for the frontend codebase using npm scripts. Since this repository uses Jujutsu (jj) for version control, which lacks native hook support, the solution uses npm scripts that can be run manually or integrated into CI pipelines.
-
-The implementation adds automated linting and type checking to catch issues before commits, improving code quality and consistency. Fixed 36 existing code formatting issues and 6 React hooks/accessibility issues during implementation.
-
-## Changes
-
-### Added
-
-* `frontend/docs/PRE_COMMIT_HOOKS.md` - Comprehensive documentation for pre-commit checks, including manual setup, optional git hooks, and usage examples
-
-### Modified
-
-* `frontend/package.json` - Added `pre-commit` and `lint:check` scripts
-  * `pre-commit`: Combines `lint:check` and `typecheck` for comprehensive validation
-  * `lint:check`: Non-writing lint check for CI/automation (runs `biome check` without `--write`)
-
-* `frontend/src/components/app/editor/TurnReferencesModal.tsx` - Fixed React hooks order violation
-  * Moved `useMemo` hooks before early return statement
-  * Complies with React Rules of Hooks
-
-* `frontend/src/components/modals/TagGlossaryModal.tsx` - Fixed accessibility issues
-  * Added proper `htmlFor` attributes to form labels
-  * Added keyboard event handler for modal backdrop
-  * Added `aria-hidden` attribute for non-interactive backdrop
-
-* 36 files auto-fixed by Biome:
-  * Import statement organization
-  * Code formatting (indentation, line breaks)
-  * Template literal usage
-  * Fragment simplification
-
-## Testing
-
-* All 237 frontend tests passing
-* TypeScript build succeeds with no errors
-* Pre-commit script successfully validates code quality
-
-## Usage
-
-Run pre-commit checks manually:
-
-```bash
-cd frontend
-npm run pre-commit
-```
-
-Integrate into CI pipeline (already supported):
-
-```bash
-npm run lint:check  # Non-writing check
-npm run typecheck   # Type validation
-```
-
-## Technical Details
-
-**Why npm scripts instead of git hooks:**
-- Jujutsu (jj) version control system lacks native hook support as of 2026-01
-- npm scripts work consistently across all VCS systems
-- Easier to maintain and understand than custom hook scripts
-- Can be integrated into CI/CD pipelines
-
-**Biome Configuration:**
-- Uses `@biomejs/biome` 2.1.4 for linting
-- Auto-fixes safe formatting issues
-- Enforces React hooks rules and accessibility standards
-
-## Release Summary
-
-Completed optional Priority 3 enhancement for frontend code quality. Implementation adds automated pre-commit validation while maintaining compatibility with the project's Jujutsu-based version control workflow.
-
-**Files affected**: 40 files (1 added, 39 modified)
-**Tests**: All 237 frontend tests passing
-**Type checking**: Zero errors
-
-## Future Enhancements
-
-Documented in `frontend/docs/PRE_COMMIT_HOOKS.md`:
-1. Optional git hook installation for git command users
-2. Optional Husky integration for robust hook management
-3. Potential CI/CD integration for automated quality gates
diff --git a/.copilot-tracking/changes/20260123-pydantic-alias-type-fix-changes.md b/.copilot-tracking/changes/20260123-pydantic-alias-type-fix-changes.md
deleted file mode 100644
index b23b1c7..0000000
--- a/.copilot-tracking/changes/20260123-pydantic-alias-type-fix-changes.md
+++ /dev/null
@@ -1,26 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Pydantic Alias Type Checker Fix
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (CI Code Quality Gates)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Fixed type checker warnings for Pydantic v2 models using field aliases. When using `alias` parameter with `populate_by_name=True`, the type checker requires using the alias names (camelCase) when instantiating models, not the field names (snake_case).
-
-## Changes
-
-### Modified
-
-* `backend/app/services/duplicate_detection_service.py` - Changed DuplicateWarning instantiation to use camelCase alias names (itemId, duplicateId, duplicateQuestion, duplicateStatus, matchReason) instead of snake_case field names
-* `backend/app/api/v1/ground_truths.py` - Changed ImportBulkResponse instantiation to use camelCase alias names (piiWarnings, duplicateWarnings, validationSummary) instead of snake_case field names
-
-## Release Summary
-
-**Type Checker Status**: All checks passed (0 diagnostics)
-**Test Results**: All 267 backend unit tests pass
-**Files Modified**: 2
-
-## Deployment Notes
-
-This change fixes type checker warnings without changing runtime behavior or API contracts. The models continue to accept both snake_case and camelCase field names during validation due to `populate_by_name=True`, but the type checker requires using the alias names for proper static analysis.
diff --git a/.copilot-tracking/changes/20260123-tag-definitions-storage-changes.md b/.copilot-tracking/changes/20260123-tag-definitions-storage-changes.md
deleted file mode 100644
index 2732233..0000000
--- a/.copilot-tracking/changes/20260123-tag-definitions-storage-changes.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Custom Tag Definitions Storage (TG-04)
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Tag Glossary - TG-04)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Implemented database storage for SME-created custom tag definitions, enabling users to define and persist custom tags with descriptions that appear in the tag glossary alongside system-defined manual and computed tags. This completes the backend foundation for TG-04, with frontend UI (TG-06) deferred to a future increment.
-
-## Changes
-
-### Added
-
-* `backend/app/adapters/repos/tag_definitions_repo.py` - Repository adapter for Cosmos DB tag definitions storage with CRUD operations (get_definition, list_all, upsert, delete)
-* `backend/tests/unit/test_tag_definitions_repo.py` - Unit tests for TagDefinitionsRepo (7 tests covering CRUD operations and error cases)
-* `COSMOS_CONTAINER_TAG_DEFINITIONS` config in `backend/app/core/config.py` - Container name constant (default: "tag_definitions")
-* `TagDefinition` domain model in `backend/app/domain/models.py` - Fields: id, tag_key (partition key), description, created_by, created_at, updated_at, doc_type
-* API endpoint `POST /v1/tags/definitions` in `backend/app/api/v1/tags.py` - Create or update custom tag definition
-* API endpoint `DELETE /v1/tags/definitions/{tag_key}` in `backend/app/api/v1/tags.py` - Delete custom tag definition
-* Request/response models `TagDefinitionRequest` and `TagDefinitionResponse` in `backend/app/api/v1/tags.py`
-
-### Modified
-
-* `backend/app/container.py` - Wire tag_definitions_repo in container initialization and validation
-* `backend/app/api/v1/tags.py` - Extended glossary endpoint to query custom definitions and merge as "custom" type group
-* `backend/scripts/cosmos_container_manager.py` - Added --tag-definitions-container flag for container creation (partition key: /tag_key, Hash)
-* `backend/tests/unit/test_tags_glossary.py` - Added mock for tag_definitions_repo and test for custom definitions in glossary response (4 tests total)
-* `IMPLEMENTATION_PLAN.md` - Marked TG-04 complete, updated with implementation details
-
-### Removed
-
-* None
-
-## Release Summary
-
-**Total Files Affected**: 9 files (639 lines added)
-- 2 new files (repository adapter + tests)
-- 7 modified files (config, domain model, container, API, script, test, plan)
-- 0 removed files
-
-**Test Coverage**: 
-- 267 backend unit tests pass (8 new tests: 7 for TagDefinitionsRepo, 1 for glossary endpoint)
-- All type checks pass (ty check)
-
-**Deployment Notes**:
-- New Cosmos DB container `tag_definitions` must be created using:
-  ```bash
-  uv run python scripts/cosmos_container_manager.py \
-    --endpoint <endpoint> \
-    --key <key> \
-    --db <database> \
-    --tag-definitions-container
-  ```
-- Container uses partition key `/tag_key` with Hash partitioning
-- No frontend changes required (glossary will display empty custom group if no definitions exist)
-- TG-06 (inline editing UI) deferred to future increment
-
-**API Changes**:
-- `GET /v1/tags/glossary` now includes "custom" group with custom tag definitions from database
-- New endpoints: `POST /v1/tags/definitions`, `DELETE /v1/tags/definitions/{tag_key}`
-- Authentication for management endpoints uses default "system" user_id (full auth deferred to TG-06)
-
-**Backward Compatibility**: 
-- Fully backward compatible
-- Glossary endpoint returns empty custom group if tag_definitions container doesn't exist
-- No breaking changes to existing API contracts
diff --git a/.copilot-tracking/changes/20260123-tag-glossary-changes.md b/.copilot-tracking/changes/20260123-tag-glossary-changes.md
deleted file mode 100644
index 44b9720..0000000
--- a/.copilot-tracking/changes/20260123-tag-glossary-changes.md
+++ /dev/null
@@ -1,72 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Tag Glossary Implementation
-
-**Related Plan**: N/A (standalone feature)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Implemented tag glossary system to provide human-readable descriptions for all tags via tooltip UI. Users can now hover over any TagChip to see a definition, improving tag understanding and consistency.
-
-## Changes
-
-### Added
-
-* `backend/app/api/v1/tags.py` - Added `/v1/tags/glossary` endpoint returning comprehensive tag definitions from manual and computed sources
-* `backend/tests/unit/test_tags_glossary.py` - Unit tests for glossary endpoint (3 tests covering manual tags, computed tags, and response structure)
-* `frontend/src/hooks/useTagGlossary.ts` - React hook to fetch and cache tag glossary data, providing lookup map of tag key -> description
-* Tag definitions to `backend/app/domain/manual_tags.json` - Extended schema to include group descriptions and per-tag descriptions for all 6 tag groups (source, answerability, topic, intent, expertise, difficulty)
-
-### Modified
-
-* `backend/app/domain/manual_tags_provider.py` - Extended `ManualTagGroup` dataclass to support optional `description` and `tag_definitions` fields; updated parser to handle both old (string list) and new (object list) tag formats
-* `backend/tests/unit/test_manual_tags_provider.py` - Updated test assertions to match new data structure with `tag_definitions` field
-* `frontend/src/components/common/TagChip.tsx` - Enhanced component to fetch and display tag descriptions via CSS tooltip on hover (using `useTagDescription` hook)
-* `frontend/src/api/openapi.json` - Regenerated OpenAPI spec to include glossary endpoint schema
-* `frontend/src/api/generated.ts` - Regenerated TypeScript types for glossary response models
-
-### Removed
-
-* None
-
-## Technical Details
-
-### Backend Implementation
-
-- **Backward compatibility**: Manual tags JSON parser accepts both old format (`tags: ["value"]`) and new format (`tags: [{"value": "...", "description": "..."}]`)
-- **Data model**: Extended `ManualTagGroup` with optional `description` (group-level) and `tag_definitions` (tag-level) fields
-- **API response**: Glossary endpoint merges manual tags from config file and computed tags from plugin registry into unified response
-- **Computed tags**: Phase 1 includes computed tags in glossary without descriptions (descriptions deferred to future phase)
-
-### Frontend Implementation
-
-- **No dependencies added**: Used native CSS tooltips instead of Radix UI to avoid adding dependencies
-- **Plain React state**: Implemented `useTagGlossary` hook with useState/useEffect instead of React Query (not installed)
-- **Tooltip UX**: Tooltips appear on hover with 200ms transition, positioned above tag with arrow pointer
-- **Fallback behavior**: Tags without definitions show no tooltip (graceful degradation)
-
-## Test Coverage
-
-- **Backend**: 3 new unit tests for glossary endpoint, all 256 backend unit tests passing
-- **Frontend**: All 226 frontend tests passing, build succeeds
-
-## Manual Testing Required
-
-1. Start dev server: `cd backend && uv run uvicorn app.main:app --reload`
-2. Start frontend: `cd frontend && npm run dev`
-3. Navigate to Questions Explorer
-4. Hover over tags to verify tooltips appear with descriptions
-5. Verify tooltips for manual tags (e.g., "source:sme") show descriptions
-6. Verify computed tags show generic "no description" behavior or empty tooltip
-
-## Release Summary
-
-**Files Added**: 2
-**Files Modified**: 7
-**Files Removed**: 0
-
-**Deployment Notes**:
-- No database migrations required
-- No environment variable changes needed
-- Backend API is backward compatible (glossary endpoint is additive)
-- Frontend gracefully handles missing glossary data
diff --git a/.copilot-tracking/changes/20260123-tag-glossary-inline-editing-changes.md b/.copilot-tracking/changes/20260123-tag-glossary-inline-editing-changes.md
deleted file mode 100644
index c69483a..0000000
--- a/.copilot-tracking/changes/20260123-tag-glossary-inline-editing-changes.md
+++ /dev/null
@@ -1,85 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Tag Glossary Inline Editing (TG-06)
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 4: Documentation > Tag Glossary)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Added inline editing capabilities for custom tag definitions in the TagGlossaryModal. SMEs can now create, edit, and delete custom tag definitions directly from the glossary UI without needing to interact with the backend API manually.
-
-## Changes
-
-### Added
-
-* `frontend/src/services/tags.ts` - Added `createTagDefinition()` and `deleteTagDefinition()` API client functions
-* `.copilot-tracking/changes/20260123-tag-glossary-inline-editing-changes.md` - This change log file
-
-### Modified
-
-* `frontend/src/hooks/useTagGlossary.ts` - Added `refresh()` method to GlossaryStore and exposed it in the hook return value
-* `frontend/src/components/modals/TagGlossaryModal.tsx` - Implemented complete inline editing UI:
-  - Added "New Custom Tag" button to controls section
-  - Implemented create form with tag key and description inputs
-  - Added Edit (pencil) and Delete (trash) buttons for custom tags
-  - Implemented inline editing mode for tag descriptions
-  - Added state management for editing/creating operations with loading states
-  - Integrated with refresh() to update glossary after mutations
-* `frontend/src/api/generated.ts` - Regenerated TypeScript types from updated OpenAPI spec
-* `frontend/src/api/openapi.json` - Regenerated from backend API (includes tag definitions endpoints)
-* `backend/pyproject.toml` - Updated (via export_openapi.py formatting)
-
-### Removed
-
-None
-
-## Implementation Details
-
-### UI Features
-
-1. **New Custom Tag Creation**:
-   - Button in controls section opens inline form
-   - Form validates both tag key and description required
-   - Cancel button discards changes
-   - Success refreshes glossary and closes form
-
-2. **Inline Editing**:
-   - Edit button appears only on custom tag entries
-   - Switches to inline textarea for description editing
-   - Save/Cancel buttons for inline editing mode
-   - Disabled state during async operations
-
-3. **Tag Deletion**:
-   - Delete button appears only on custom tag entries
-   - Confirmation dialog prevents accidental deletion
-   - Success refreshes glossary to reflect removal
-
-4. **Error Handling**:
-   - Alert dialogs for API errors with descriptive messages
-   - Disabled UI during async operations (submitting state)
-   - Form validation before submission
-
-### API Integration
-
-- `POST /v1/tags/definitions` - Create or update custom tag definition
-- `DELETE /v1/tags/definitions/{tag_key}` - Delete custom tag definition
-- Both endpoints use the existing backend infrastructure (TG-04)
-
-### Testing Results
-
-- All 237 frontend tests pass
-- All 267 backend unit tests pass
-- TypeScript build succeeds with no errors
-- Frontend production build succeeds
-
-## Release Summary
-
-**Files Modified**: 5
-**Files Added**: 1
-**Files Removed**: 0
-
-**Deployment Notes**: 
-- Frontend requires rebuild to include new inline editing UI
-- Backend API endpoints already exist from TG-04 implementation
-- No database migrations required
-- Feature is backward compatible with existing glossary functionality
diff --git a/.copilot-tracking/changes/20260123-tristate-tag-filtering-changes.md b/.copilot-tracking/changes/20260123-tristate-tag-filtering-changes.md
deleted file mode 100644
index bbc2235..0000000
--- a/.copilot-tracking/changes/20260123-tristate-tag-filtering-changes.md
+++ /dev/null
@@ -1,61 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Tri-State Tag Filtering
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 2: Tag Filtering Enhancement)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Implemented tri-state tag filtering with include/exclude/neutral states, enabling users to filter items by tags they want to include AND tags they want to exclude. This replaces the binary include-only filtering with a more powerful tri-state system.
-
-## Changes
-
-### Added
-
-* `backend/app/adapters/repos/base.py` - Added `exclude_tags` parameter to repository base interface
-* `backend/app/api/v1/ground_truths.py` - Added `exclude_tags` query parameter support to GET /api/v1/ground-truths endpoint
-
-### Modified
-
-* `backend/app/adapters/repos/cosmos_repo.py` - Implemented exclude tag filtering in both SQL and in-memory paths
-* `backend/tests/unit/test_cosmos_repo.py` - Added tests for exclude tag filtering
-* `frontend/src/types/filters.ts` - Changed `TagFilter` type from `string[]` to `{ include: string[], exclude: string[] }`
-* `frontend/src/utils/filterUrlParams.ts` - Updated URL parsing/serialization to support tri-state tag structure with `excludeTags` parameter
-* `frontend/src/components/app/QuestionsExplorer.tsx` - Implemented tri-state toggle UI (Include/Exclude/Neutral) for tag filtering
-* `frontend/src/services/groundTruths.ts` - Updated API service to pass exclude_tags parameter
-* `frontend/tests/unit/utils/filterUrlParams.test.ts` - Updated tests to reflect new tri-state tag structure
-
-### Removed
-
-* None
-
-## Technical Implementation
-
-### Backend Changes
-
-1. **Repository Layer**: Added `exclude_tags` parameter to base repository interface and Cosmos implementation
-2. **Query Logic**: 
-   - SQL path: Uses `NOT (ARRAY_CONTAINS(c.manualTags, @excludeTag) OR ARRAY_CONTAINS(c.computedTags, @excludeTag))` clauses
-   - In-memory path: Filters out items with ANY excluded tag using set intersection
-3. **API Layer**: Added `exclude_tags` query parameter (comma-separated list) to ground truths list endpoint
-
-### Frontend Changes
-
-1. **Type System**: Changed `TagFilter` from simple string array to object with `include` and `exclude` arrays
-2. **URL State**: Added `excludeTags` URL parameter for bookmarkable exclude filters
-3. **UI**: Implemented tri-state toggle buttons showing Include (green) / Exclude (red) / Neutral (gray) states
-4. **Behavior**: Clicking cycles through states: Neutral → Include → Exclude → Neutral
-
-## Testing
-
-- **Backend**: All 236 unit tests passing, including new exclude tag filter tests
-- **Frontend**: All 226 tests passing, updated filter URL parsing tests for tri-state structure
-- **Integration**: Verified exclude tags work with keyword search, status filters, and URL persistence
-
-## Release Summary
-
-**Files Changed**: 9 files (8 modified, 1 test file)
-**Lines Changed**: ~200 lines added/modified
-**Test Coverage**: Comprehensive unit tests for both backend and frontend
-
-This feature enables more precise filtering in the explorer view, allowing users to find items that have certain tags while excluding items with other tags. The tri-state UI provides clear visual feedback and the URL persistence makes filtered views bookmarkable.
diff --git a/.copilot-tracking/changes/20260123-type-checker-fixes-changes.md b/.copilot-tracking/changes/20260123-type-checker-fixes-changes.md
deleted file mode 100644
index b5db47e..0000000
--- a/.copilot-tracking/changes/20260123-type-checker-fixes-changes.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: Type Checker Error Fixes
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 3: Technical Debt & Code Quality)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Resolved type checker errors discovered when running `uv run ty check app/` on the backend codebase. Fixed incorrect method call signatures and added proper type ignore annotations for Pydantic v2 aliasing patterns.
-
-## Changes
-
-### Modified
-
-* `backend/app/adapters/repos/cosmos_repo.py` - Fixed `_build_query_filter` method calls:
-  - Line 1083-1089: Added missing `exclude_tags` parameter (None) in emulator path
-  - Line 1115-1121: Added missing `exclude_tags` parameter (None) in stats count path
-  - Line 1150-1156: Added missing `exclude_tags` parameter (None) in base count path
-* `backend/app/api/v1/ground_truths.py` - Fixed duplicate detection query:
-  - Line 202: Changed `status=[GroundTruthStatus.approved]` to `status=GroundTruthStatus.approved` (method expects single value, not list)
-  - Line 231-239: Added `# type: ignore[call-arg,misc]` for Pydantic v2 aliasing (populate_by_name pattern)
-* `backend/app/core/rate_limiter.py` - Added type ignore for FastAPI exception handler:
-  - Line 69: Added `# type: ignore[arg-type]` for exception handler signature (type checker limitation with FastAPI's ExceptionHandler type)
-* `backend/app/services/duplicate_detection_service.py` - Added type ignore for Pydantic aliasing:
-  - Line 114: Added `# type: ignore[call-arg]` for DuplicateWarning constructor (populate_by_name pattern)
-* `.copilot-tracking/changes/20260123-type-checker-fixes-changes.md` - This change log file
-
-### Removed
-
-None
-
-### Added
-
-None
-
-## Implementation Details
-
-### Type Checker Errors Fixed
-
-1. **Missing `exclude_tags` parameter**: The `_build_query_filter` method signature was updated in a previous commit to include `exclude_tags` parameter, but three call sites were not updated. All three paths (emulator, stats count, base count) pass `None` for this parameter as tag exclusion is not needed in those contexts.
-
-2. **List vs single value for status**: The `list_gt_paginated` method accepts a single `GroundTruthStatus` value, not a list. Fixed the duplicate detection query to pass the enum directly.
-
-3. **Pydantic v2 aliasing**: Pydantic v2's `populate_by_name=True` configuration allows using both Python field names and aliases in constructors. However, the type checker doesn't understand this pattern, so we added targeted `type: ignore` comments. This is a known limitation and the runtime behavior is correct.
-
-4. **FastAPI exception handler signature**: FastAPI's `add_exception_handler` has complex union types that the type checker interprets strictly. Added `type: ignore[arg-type]` as the runtime signature is correct but doesn't match the type checker's expectations.
-
-### Testing Results
-
-- All 267 backend unit tests pass
-- All 237 frontend tests pass
-- `uv run ty check app/` shows 0 errors (only warnings about Pydantic aliasing)
-
-## Release Summary
-
-**Files Modified**: 4
-**Files Added**: 1
-**Files Removed**: 0
-
-**Deployment Notes**: 
-- No functional changes, only type annotations
-- No database migrations required
-- No API changes
-- Changes improve type safety and catch potential bugs during development
-
-## Learnings
-
-- Always run `uv run ty check app/` after making changes to catch type errors early
-- Pydantic v2's `populate_by_name=True` requires `type: ignore` comments for the type checker
-- The `_build_query_filter` method signature should be checked at all call sites when modifying parameters
diff --git a/.copilot-tracking/changes/20260123-xss-sanitization-changes.md b/.copilot-tracking/changes/20260123-xss-sanitization-changes.md
deleted file mode 100644
index 1865d0e..0000000
--- a/.copilot-tracking/changes/20260123-xss-sanitization-changes.md
+++ /dev/null
@@ -1,49 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Release Changes: XSS Sanitization
-
-**Related Plan**: IMPLEMENTATION_PLAN.md (Priority 0 - Security)
-**Implementation Date**: 2026-01-23
-
-## Summary
-
-Extracted URL validation logic to a shared utility module and applied it consistently across all reference URL handlers in the frontend. This ensures that malicious URL schemes (javascript:, data:, vbscript:, etc.) are blocked before opening, protecting users from XSS attacks even if backend data is compromised. Also updated all external link rel attributes to use "noopener noreferrer" for complete protection against tabnapping attacks.
-
-## Changes
-
-### Added
-
-* `frontend/src/utils/urlValidation.ts` - New shared utility module containing `validateReferenceUrl` function that blocks unsafe URL protocols and malicious patterns
-
-### Modified
-
-* `frontend/src/components/modals/InspectItemModal.tsx` - Replaced inline `validateReferenceUrl` function with import from shared utility
-* `frontend/src/demo.tsx` - Added URL validation to `onOpenRef` function before opening references; added error toast for invalid URLs
-* `frontend/src/components/app/editor/TurnReferencesModal.tsx` - Updated 2 anchor tags to use `rel="noopener noreferrer"` instead of just `rel="noreferrer"`
-* `frontend/src/components/app/ReferencesPanel/SelectedTab.tsx` - Updated anchor tag to use `rel="noopener noreferrer"`
-* `frontend/src/components/app/InstructionsPane.tsx` - Updated anchor tag to use `rel="noopener noreferrer"`
-* `frontend/src/components/common/MarkdownRenderer.tsx` - Updated anchor tags in both `mdComponents` and `compactComponents` to use `rel="noopener noreferrer"`
-* `IMPLEMENTATION_PLAN.md` - Marked XSS Sanitization task as ✅ IMPLEMENTED with implementation details
-
-### Removed
-
-* None
-
-## Release Summary
-
-**Total files affected**: 8 files (1 added, 7 modified)
-
-**Security Impact**: 
-- All reference URL handlers now validate URLs before opening
-- Blocked schemes: javascript:, data:, vbscript:, about:, blob:
-- All external links now protected against tabnapping with "noopener noreferrer"
-- User-facing error message when attempting to open invalid/unsafe URLs
-
-**Testing**: 
-- All 195 frontend unit tests passing
-- TypeScript compilation successful
-- No breaking changes to existing functionality
-
-**Deployment Notes**: 
-- No database migrations required
-- No configuration changes required
-- Frontend changes only - no backend impact
diff --git a/.copilot-tracking/details/20260116-export-pipeline-design-details.md b/.copilot-tracking/details/20260116-export-pipeline-design-details.md
deleted file mode 100644
index e9ae0b4..0000000
--- a/.copilot-tracking/details/20260116-export-pipeline-design-details.md
+++ /dev/null
@@ -1,353 +0,0 @@
----
-description: Implementation details for export pipeline design in Ground Truth Curator
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-# Task Details: Export Pipeline Design
-
-## Research Reference
-
-Source research: `.copilot-tracking/research/20260116-export-pipeline-design-research.md`
-
-## Phase 1: Confirm export requirements and compatibility targets
-
-### Task 1.1: Document the current export behavior baseline
-
-Capture the observable behaviors that must remain stable:
-
-- `POST /v1/ground-truths/snapshot` writes per-item JSON artifacts plus `manifest.json` under `exports/snapshots/{ts}/`
-- `GET /v1/ground-truths/snapshot` returns a downloadable attachment with `Content-Disposition`
-- Frontend download behavior depends on `Content-Disposition` parsing
-
-Files:
-
-- `backend/app/services/snapshot_service.py`
-- `backend/app/api/v1/ground_truths.py`
-- `frontend/src/services/groundTruths.ts`
-
-Success:
-
-- A short “baseline contract” section exists in the design notes (what stays stable, what can change)
-- The plan identifies what the new pipeline must not break
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 31-70) - Verified baseline behaviors for snapshot write/download and frontend expectations
-
-Dependencies:
-
-- None
-
-### Task 1.2: Decide the v1 export pipeline API surface
-
-Choose a minimal, forward-compatible API for pipeline-based exports.
-
-Recommended options:
-
-- Option A: `GET /v1/exports/ground-truths` with query params for filters and format selection
-- Option B: `POST /v1/exports/ground-truths` with a request body describing filters and options
-
-Decide and document:
-
-- Supported formats for the initial milestone (at least JSON)
-- Supported filters for the initial milestone (dataset, status; tags optional)
-- Whether exports always operate on approved items or can be generalized
-
-Files:
-
-- New router file (proposed): `backend/app/api/v1/exports.py`
-- Existing snapshot routes: `backend/app/api/v1/ground_truths.py`
-
-Success:
-
-- The plan includes a clear route definition and request/response shape
-- Backward compatibility for snapshot endpoints is explicitly preserved
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 154-160) - API surface recommendations for pipeline-based exports
-
-Dependencies:
-
-- Task 1.1 completion
-
-## Phase 2: Define the export pipeline abstractions (processors, formatters, registry)
-
-### Task 2.1: Specify processor and formatter interfaces
-
-Define interfaces aligned to the existing design in `docs/computed-tags-design.md`:
-
-- Export processors: `List[dict] -> List[dict]`
-- Export formatters: `List[dict] -> bytes | str`
-
-Document required properties:
-
-- Stable `name`/`format_name` identifiers
-- Deterministic behavior requirements for tests
-- Error handling conventions (raise vs collect errors)
-
-Files:
-
-- Proposed new module(s):
-  - `backend/app/exports/processors/base.py`
-  - `backend/app/exports/formatters/base.py`
-
-Success:
-
-- Interfaces are defined in the plan with method signatures and naming rules
-- The registry approach (discover/register) is specified
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 138-153) - Proposed pipeline architecture and core concept definitions
-- `docs/computed-tags-design.md` (Export pipeline architecture section)
-
-Dependencies:
-
-- Task 1.2 completion
-
-### Task 2.2: Specify registries and configuration strategy
-
-Define:
-
-- `ExportProcessorRegistry` to register processors and prevent name collisions
-- `ExportFormatterRegistry` to register formatters and resolve requested formats
-- Configuration via environment variable(s), e.g. `GTC_EXPORT_PROCESSOR_ORDER="merge_tags,anonymize"`
-
-Decide:
-
-- How unknown processor/formatter names fail (400 with clear error)
-- Defaults when env vars are empty or unset
-
-Files:
-
-- Proposed new module(s):
-  - `backend/app/exports/registry.py`
-  - `backend/app/core/config.py` (new settings fields)
-
-Success:
-
-- Plan documents exact env var names, defaults, and validation rules
-- Plan specifies how registries are wired in `backend/app/container.py`
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 97-105) - Existing docs mention processor ordering via env var
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 183-191) - Naming conventions and determinism guidance for registry/config behavior
-
-Dependencies:
-
-- Task 2.1 completion
-
-## Phase 3: Define the execution flow (load, process, format, deliver)
-
-### Task 3.1: Specify export execution orchestration
-
-Design an `ExportService` (or extend `SnapshotService`) that:
-
-1. Loads items from `GroundTruthRepo` using the selected filters
-2. Converts items to export records (`model_dump(..., by_alias=True)`)
-3. Applies the configured processor chain
-4. Formats output using the selected formatter
-5. Delivers output either as:
-   - A generated artifact (FileResponse), or
-   - An in-memory attachment (JSONResponse), or
-   - A streaming response for large payloads
-
-Files:
-
-- Proposed new service: `backend/app/services/export_service.py`
-- Existing snapshot service: `backend/app/services/snapshot_service.py`
-
-Success:
-
-- The plan contains a step-by-step flow diagram (text is sufficient)
-- The plan specifies where `Content-Disposition` filename is generated
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 161-176) - Streaming and file download guidance (FileResponse/StreamingResponse)
-
-Dependencies:
-
-- Phase 2 completion
-
-### Task 3.2: Define initial processors and formatters
-
-Initial candidates (minimum viable set):
-
-- Processor: `merge_tags` (construct `tags = unique(manualTags + computedTags)` and/or enforce export union contract)
-- Formatter: `json_snapshot_payload` (preserve current `GET /snapshot` payload shape)
-- Formatter: `json_items` (export list of items only)
-
-Document:
-
-- Exact JSON shapes
-- How schemaVersion is set
-- Whether manifest is included and what fields it contains
-
-Files:
-
-- Proposed new modules under `backend/app/exports/`
-
-Success:
-
-- The plan has JSON examples for each formatter output
-- The plan explicitly preserves current snapshot payload keys used by the frontend
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 118-124) - Computed tags/export compatibility considerations
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 146-152) - Export record/processor/formatter contract
-
-Dependencies:
-
-- Task 3.1 completion
-
-## Phase 4: Storage targets (multi-backend) and artifact strategy
-
-### Task 4.1: Define a multi-backend export storage interface
-
-Define an `ExportStorage` (or similarly named) abstraction that the pipeline-based export endpoint will use to write artifacts.
-
-Design goals:
-
-- Support multiple backends behind a stable interface.
-- Make Azure Blob the initial concrete implementation.
-- Optionally keep a local filesystem implementation for dev/test.
-
-Decide whether:
-
-- `SnapshotStorage` is generalized to `ExportStorage` and snapshot uses it, or
-- Snapshot remains its own service, while the new export pipeline uses a separate storage abstraction
-
-If generalized, define:
-
-- Methods required beyond `write_json` (e.g., `write_bytes`, `open_read`, `list_prefix`)
-- A minimal “artifact key” strategy (prefix + timestamp + logical filename)
-
-Files:
-
-- `backend/app/adapters/storage/base.py`
-- `backend/app/adapters/storage/local_fs.py`
-
-Success:
-
-- The plan includes a clear abstraction boundary and migration steps
-- The plan identifies the minimal method set required for Blob and local FS
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 72-81) - Existing storage adapter building blocks and current bypass
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 177-181) - Evolution plan for integrating generalized storage (Blob-first)
-
-Dependencies:
-
-- Phase 3 completion
-
-### Task 4.2: Specify Azure Blob configuration and authentication strategy
-
-Document settings required for Blob-first storage:
-
-- Container name
-- Storage account URL (or connection string, if preferred)
-- Authentication approach:
-  - Recommended: Managed Identity / `DefaultAzureCredential` via `azure-identity`
-  - Alternative: connection string via environment variable
-
-Also document required dependency changes:
-
-- Add `azure-storage-blob` to backend runtime dependencies
-
-Files:
-
-- `backend/app/core/config.py` (new settings fields)
-- `backend/pyproject.toml` (dependency addition)
-
-Success:
-
-- Plan lists exact env var names and the auth priority order
-- Plan notes settings strictness (`extra="forbid"`) and the need to add fields explicitly
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 83-96) - Verified Blob readiness gaps + existing `azure-identity` dependency
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 132-136) - Additional gaps for Blob-first implementation
-
-Dependencies:
-
-- Task 4.1 completion
-
-### Task 4.3: Define delivery strategy for Blob-hosted artifacts
-
-Decide what the export endpoint returns when using Blob storage:
-
-- Option A: Backend streams content (downloads from Blob and proxies to client) while preserving `Content-Disposition`
-- Option B: Backend returns a short-lived SAS URL (client downloads directly)
-- Option C: Backend returns an export “job id” and a separate download endpoint
-
-Document:
-
-- Security expectations (who can access the artifact, TTL, auditing)
-- Frontend changes required (if any) based on chosen option
-
-Files:
-
-- Proposed router file: `backend/app/api/v1/exports.py`
-
-Success:
-
-- Plan selects one option for the initial milestone and documents the rationale
-- Plan preserves existing snapshot download behavior
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 62-70) - Frontend depends on `Content-Disposition` filename parsing
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 161-176) - FileResponse/StreamingResponse guidance
-
-Dependencies:
-
-- Task 4.2 completion
-
-## Phase 5: Testing, observability, and rollout
-
-### Task 5.1: Add test strategy for pipeline configuration and outputs
-
-Plan tests covering:
-
-- Registry duplicate name protections
-- Processor order configuration parsing
-- JSON output shape compatibility with existing snapshot download tests
-- Large payload path choice (artifact vs streaming) is at least unit-tested via a small fake dataset
-
-Files:
-
-- `backend/tests/unit/` (new tests)
-- Existing snapshot tests for reference:
-  - `backend/tests/unit/test_snapshot_service.py`
-  - `backend/tests/integration/ground_truths/test_snapshot_download_endpoint.py`
-
-Success:
-
-- Tests are identified by file and target behavior
-- The plan includes a rollout step that does not break existing endpoints
-
-Research references:
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` (Lines 188-191) - Determinism guidance for stable tests
-
-Dependencies:
-
-- Phase 4 completion
-
-## Dependencies
-
-- Backend: FastAPI + Pydantic v2 (already in repo)
-- Storage: multi-backend interface with Azure Blob as initial concrete implementation (local filesystem optional for dev/test)
-
-## Success Criteria
-
-- A clear design exists for processors/formatters/registries, aligned to existing snapshot behavior
-- Backward compatibility for snapshot endpoints is preserved
-- The plan includes a minimal initial implementation slice (JSON export) and a growth path (additional formats and storage targets)
diff --git a/.copilot-tracking/details/20260116-export-pipeline-implementation-details.md b/.copilot-tracking/details/20260116-export-pipeline-implementation-details.md
deleted file mode 100644
index 879a42c..0000000
--- a/.copilot-tracking/details/20260116-export-pipeline-implementation-details.md
+++ /dev/null
@@ -1,304 +0,0 @@
----
-description: Implementation details for building the export pipeline into the backend codebase
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-# Task Details: Export Pipeline Implementation
-
-## Research Reference
-
-**Source Research**: .copilot-tracking/research/20260116-export-pipeline-implementation-research.md
-
-## Phase 1: Lock down compatibility contract
-
-### Task 1.1: Confirm snapshot endpoint contracts (write + download)
-
-Ensure the implementation preserves the behaviors that are already tested and used by the frontend:
-
-- `POST /v1/ground-truths/snapshot` continues to write per-item JSON artifacts plus `manifest.json` under `exports/snapshots/{ts}/` and returns JSON with `snapshotDir`, `count`, and `manifestPath`.
-- `GET /v1/ground-truths/snapshot` continues to return an `application/json` attachment with `Content-Disposition` containing a filename and stable payload keys.
-
-* **Files**:
-  - backend/app/api/v1/ground_truths.py
-  - backend/app/services/snapshot_service.py
-  - backend/tests/integration/test_snapshot_artifacts_cosmos.py
-  - backend/tests/integration/ground_truths/test_snapshot_download_endpoint.py
-  - frontend/src/services/groundTruths.ts
-
-* **Success**:
-  - Existing snapshot unit + integration tests remain the baseline acceptance gate.
-  - Frontend download behavior remains unchanged (filename derived from header).
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 27-52) - Verified snapshot behaviors and frontend coupling
-
-* **Dependencies**:
-  - None
-
-### Task 1.2: Define compatibility-safe defaults for pipeline adoption
-
-Decide how the pipeline will be introduced without changing existing behavior:
-
-- Treat an omitted/empty request body for `POST /v1/ground-truths/snapshot` as the legacy behavior (artifact write + manifest).
-- Use the new pipeline request model only when request fields are provided.
-- Keep the `GET /v1/ground-truths/snapshot` behavior stable, but allow its internal implementation to be pipeline-driven.
-
-* **Files**:
-  - docs/computed-tags-design.md (Section 4.4)
-  - backend/app/api/v1/ground_truths.py
-
-* **Success**:
-  - A clear decision is written into the implementation as code-level defaults.
-  - No existing callers must change.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 85-108) - Design requirements and compatibility rule
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 158-160) - Compatibility traps
-
-* **Dependencies**:
-  - Task 1.1 completion
-
-## Phase 2: Build pipeline core (registries + request models)
-
-### Task 2.1: Add export pipeline request/option models
-
-Implement Pydantic models for the pipeline request body (v1) and internal options, aligned to the design:
-
-- `format` (initial: `json_snapshot_payload`, `json_items`)
-- `filters` (initial: `datasetNames`, `status` with default `approved`)
-- `processors` (optional override list)
-- `delivery.mode` (initial support: `attachment`, `artifact`, `stream`)
-
-* **Files**:
-  - backend/app/exports/models.py (new)
-
-* **Success**:
-  - Request validation errors map to 400 with clear messages for unknown formats/processors.
-  - Defaults preserve legacy snapshot behavior when request is missing.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 85-108) - Interfaces, config, and compatibility requirements
-
-* **Dependencies**:
-  - Phase 1 completion
-
-### Task 2.2: Implement processor and formatter registries
-
-Create registries consistent with repo patterns (like computed-tags registry), supporting:
-
-- register-by-name with duplicate rejection
-- resolve-by-name with clear error messages
-- resolve ordered processor chain from:
-  - request override
-  - or `GTC_EXPORT_PROCESSOR_ORDER` default
-
-* **Files**:
-  - backend/app/exports/registry.py (new)
-  - backend/app/core/config.py (add `EXPORT_PROCESSOR_ORDER` setting)
-
-* **Success**:
-  - Registry unit tests cover duplicate registration and missing name resolution.
-  - Environment parsing is deterministic and whitespace-tolerant.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 71-83) - Existing registry patterns
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 96-98) - `GTC_EXPORT_PROCESSOR_ORDER`
-
-* **Dependencies**:
-  - Task 2.1 completion
-
-## Phase 3: Implement initial processors and formatters
-
-### Task 3.1: Implement processor `merge_tags`
-
-Add a processor that derives a `tags` field as the unique union of `manualTags` and `computedTags`.
-
-- Input/output records must remain JSON-serializable dictionaries.
-- Preserve `manualTags` and `computedTags` as-is.
-
-* **Files**:
-  - backend/app/exports/processors/merge_tags.py (new)
-
-* **Success**:
-  - Unit tests verify order stability (e.g., sorted output) and correct union behavior.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 98-101) - Initial pipeline features
-
-* **Dependencies**:
-  - Phase 2 completion
-
-### Task 3.2: Implement formatters `json_items` and `json_snapshot_payload`
-
-Add formatters:
-
-- `json_items`: returns a JSON array of export records
-- `json_snapshot_payload`: returns the stable snapshot payload envelope (`schemaVersion`, `snapshotAt`, `datasetNames`, `count`, `filters`, `items`)
-
-* **Files**:
-  - backend/app/exports/formatters/json_items.py (new)
-  - backend/app/exports/formatters/json_snapshot_payload.py (new)
-  - backend/app/services/snapshot_service.py (delegate payload assembly as needed)
-
-* **Success**:
-  - Formatter outputs match existing snapshot payload expectations.
-  - Tests compare parsed JSON objects (not raw strings) for stability.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 31-45) - Current payload keys and tests
-
-* **Dependencies**:
-  - Task 3.1 completion
-
-## Phase 4: Storage backends and delivery modes
-
-### Task 4.1: Define and implement an export storage interface
-
-Create an export storage protocol matching the design:
-
-- `write_json(key: str, obj: dict) -> None`
-- `write_bytes(key: str, data: bytes, content_type: str) -> None`
-- `open_read(key: str)` (for streaming reads)
-- `list_prefix(prefix: str)` (optional, for artifact discovery)
-
-* **Files**:
-  - backend/app/exports/storage/base.py (new)
-  - backend/app/exports/storage/local.py (new)
-
-* **Success**:
-  - Local storage implementation supports the existing snapshot directory layout.
-  - Storage key layout follows `exports/snapshots/{timestamp}/{filename}`.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 53-63) - Existing storage abstraction and current bypass
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 101-105) - Storage interface and delivery modes
-
-* **Dependencies**:
-  - Phase 3 completion
-
-### Task 4.2: Add Azure Blob storage backend
-
-Implement Blob storage backend using managed identity (`DefaultAzureCredential`) and the Azure Blob SDK.
-
-- Add dependency: `azure-storage-blob`
-- Add explicit settings to `Settings` (due to `extra="forbid"`):
-  - `EXPORT_STORAGE_BACKEND` (`local|blob`)
-  - `EXPORT_BLOB_ACCOUNT_URL`
-  - `EXPORT_BLOB_CONTAINER`
-
-* **Files**:
-  - backend/pyproject.toml
-  - backend/app/core/config.py
-  - backend/app/exports/storage/blob.py (new)
-
-* **Success**:
-  - Blob backend can write and read artifacts using async client (`azure.storage.blob.aio`).
-  - Settings validation fails fast with clear errors when backend is `blob` but configuration is missing.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 64-70) - Dependency/config constraints
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 142-155) - SDK and local dev operational notes
-
-* **Dependencies**:
-  - Task 4.1 completion
-
-### Task 4.3: Implement delivery modes (attachment/artifact/stream)
-
-Implement delivery behavior in the pipeline service:
-
-- `attachment`: return JSON payload bytes with `Content-Disposition` filename
-- `artifact`: write artifacts + `manifest.json` and return legacy `snapshotDir`/`manifestPath` response
-- `stream`: return a `StreamingResponse` over bytes (for large payloads or Blob reads)
-
-* **Files**:
-  - backend/app/exports/pipeline.py (new)
-  - backend/app/api/v1/ground_truths.py
-
-* **Success**:
-  - `GET /v1/ground-truths/snapshot` preserves current attachment semantics.
-  - `POST /v1/ground-truths/snapshot` preserves current artifact-write behavior by default.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 133-140) - FastAPI response patterns
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 158-160) - Compatibility traps
-
-* **Dependencies**:
-  - Task 4.2 completion
-
-## Phase 5: Wire into container, API, and tests
-
-### Task 5.1: Wire registries, storage, and pipeline via container
-
-Add pipeline service wiring in the singleton container so routers and services can depend on it.
-
-* **Files**:
-  - backend/app/container.py
-
-* **Success**:
-  - Pipeline dependencies are constructed once per app lifecycle (consistent with other services).
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 123-129) - Router/service integration expectations
-
-* **Dependencies**:
-  - Phase 4 completion
-
-### Task 5.2: Update snapshot service and routes to use pipeline internally
-
-Implement delegation so:
-
-- `SnapshotService.build_snapshot_payload()` and `export_json()` route through the pipeline logic.
-- API routes remain compatible but gain pipeline support when request parameters are provided.
-
-* **Files**:
-  - backend/app/services/snapshot_service.py
-  - backend/app/api/v1/ground_truths.py
-
-* **Success**:
-  - Existing snapshot tests pass without modification.
-  - Pipeline unit tests pass.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 31-45) - Existing endpoint contracts
-
-* **Dependencies**:
-  - Task 5.1 completion
-
-### Task 5.3: Add new unit tests for pipeline components
-
-Add tests for:
-
-- registry behaviors (duplicate and missing names)
-- `merge_tags` processor correctness
-- formatter outputs
-- delivery mode selection (unit-test level)
-
-* **Files**:
-  - backend/tests/unit/test_export_registry.py (new)
-  - backend/tests/unit/test_export_pipeline.py (new)
-
-* **Success**:
-  - New unit tests provide fast validation of pipeline semantics.
-  - Existing integration snapshot tests continue to pass.
-
-* **Research References**:
-  - .copilot-tracking/research/20260116-export-pipeline-implementation-research.md (Lines 162-166) - Suggested verification approach
-
-* **Dependencies**:
-  - Task 5.2 completion
-
-## Dependencies
-
-- Python 3.11 (repo requirement)
-- FastAPI + Starlette responses already in use
-- Azure SDK dependencies:
-  - `azure-identity` (already present)
-  - `azure-storage-blob` (to be added)
-
-## Success Criteria
-
-- Snapshot endpoints remain backward compatible (tests and frontend behavior)
-- Pipeline components exist (registries, processor, formatter, delivery)
-- Local storage works as today; Blob storage works behind a feature flag
-- Tests cover pipeline logic and existing snapshot tests continue passing
diff --git a/.copilot-tracking/details/20260116-manual-tags-design-details.md b/.copilot-tracking/details/20260116-manual-tags-design-details.md
deleted file mode 100644
index 2246817..0000000
--- a/.copilot-tracking/details/20260116-manual-tags-design-details.md
+++ /dev/null
@@ -1,196 +0,0 @@
----
-title: Manual Tags Design Details
-description: Detailed specifications and execution notes for manual tags design work
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-# Task Details: Manual Tags Design
-
-## Research Reference
-
-* Source research: `.copilot-tracking/research/20260116-manual-tags-design-research.md`
-* Design context: `docs/computed-tags-design.md`
-* Existing tagging constraints: `backend/docs/tagging_plan.md`
-
-## Phase 1: Confirm requirements and align policy
-
-### Task 1.1: Decide manual tag validation mode(s) (MVP)
-
-MVP requirement (confirmed): enforce **mutual exclusivity within a tag group**, and make that enforcement **configurable (true/false)**.
-
-Define the desired validation policy for manual tags across these write paths:
-
-* Interactive edits (`PUT /v1/ground-truths/...`)
-* Assignment updates (`PUT /v1/assignments/...`)
-* Bulk import validation
-
-Current state is already exclusivity-aware for **known** groups via `TAG_SCHEMA` + `ExclusiveGroupRule`, but enforcement is not currently configurable.
-
-* Model validation is relaxed (unknown groups/values allowed).
-* Bulk import uses a strict allow-set from the global tag registry.
-
-Specify the exclusivity enforcement semantics:
-
-* Scope of the toggle:
-  * Global toggle (recommended MVP): enable/disable enforcement of `exclusive=True` groups
-  * Optional later: per-group overrides (in `TAG_SCHEMA` or config)
-* Default:
-  * Recommended: `true` (keep current correctness; allow disabling in dev/experiments)
-* Contract with the frontend:
-  * Decide whether `/v1/tags/schema` should continue to report per-group `exclusive` even when enforcement is disabled server-side.
-    * Recommended: keep reporting `exclusive` so the UI can still guide the user, while backend can be relaxed if needed.
-
-* Files:
-  * `backend/app/domain/validators.py` (model-level coercion + validation)
-  * `backend/app/services/validation_service.py` (bulk import strict validation)
-  * `backend/app/api/v1/ground_truths.py` and `backend/app/api/v1/assignments.py` (write paths)
-* Success:
-  * A single documented policy for exclusivity enforcement (enabled/disabled)
-  * A clear statement whether the backend or frontend is authoritative when the toggle is off
-* Research references:
-  * `.copilot-tracking/research/20260116-manual-tags-design-research.md` (Current behavior summary + decision points)
-
-### Task 1.2: Decide the source of truth for “allowed manual tags” (optional / follow-up)
-
-This is useful, but not required for the exclusivity MVP. Capture the intended direction so future work is scoped.
-
-Choose the authoritative source for the tag picker (manual tags list):
-
-* Static config (`GTC_ALLOWED_MANUAL_TAGS`)
-* Global registry (Cosmos-backed list)
-* A combined approach (seed from schema, allow extensions via registry)
-
-Also decide whether the source must vary by dataset/tenant in the future.
-
-* Files:
-  * `backend/app/api/v1/tags.py` (current: allowlist overrides registry)
-  * `backend/app/core/config.py` (existing settings surface)
-  * `backend/app/adapters/repos/tags_repo.py` (Cosmos shape for global registry)
-* Success:
-  * A single selection mechanism that the backend implements and the frontend can rely on
-  * Clear behavior when `GTC_ALLOWED_MANUAL_TAGS` is set
-
-## Phase 2: Implement configurable exclusivity (backend)
-
-### Task 2.1: Add a config flag to enable/disable exclusivity enforcement
-
-Add a settings flag (e.g., `GTC_TAGS_ENFORCE_EXCLUSIVITY: bool`) that controls whether the backend enforces mutual exclusivity for groups marked `exclusive=True`.
-
-Implementation approach options:
-
-* Toggle rule execution:
-  * Build `RULES` dynamically at runtime based on settings
-  * Or: keep `RULES` constant but gate `ExclusiveGroupRule.check()` behind the flag
-
-Ensure the flag is applied consistently anywhere `validate_tags()` is used for manual tags.
-
-* Files:
-  * `backend/app/core/config.py` (new setting)
-  * `backend/app/domain/tags.py` (rule wiring / schema)
-  * `backend/app/services/tagging_service.py` (validation flow)
-  * `backend/app/domain/validators.py` (model validation uses `validate_tags()`)
-  * `backend/app/api/v1/ground_truths.py` and `backend/app/api/v1/assignments.py` (write path validation behavior)
-* Success:
-  * When enabled: multiple tags from an exclusive group are rejected everywhere manual tags are accepted
-  * When disabled: multiple tags from an exclusive group are accepted (still requiring canonical `group:value` format)
-
-### Task 2.2: Document and expose the exclusivity flag (if needed)
-
-Decide whether the flag should be:
-
-* Backend-only (env setting), or
-* Also exposed to the frontend via `/v1/config` so the UI can mirror server policy.
-
-* Files:
-  * `backend/app/core/config.py`
-  * `backend/app/api/v1/config.py` (if exposing to frontend)
-* Success:
-  * The configured behavior is discoverable and documented
-
-## Phase 3: Validation and normalization improvements
-
-### Task 3.1: Ensure exclusivity toggle does not affect computed-tags invariants
-
-Ensure the exclusivity flag only controls exclusivity checks, and does not regress:
-
-* Canonicalization (`group:value`, lowercase, etc.)
-* Computed-tags stripping from manual tags (write path)
-
-* Files:
-  * `backend/app/services/tagging_service.py`
-  * `backend/app/services/validation_service.py` (bulk import)
-* Success:
-  * Exclusivity can be toggled independently without breaking other tag guarantees
-
-### Task 3.2: Confirm computed tag stripping remains authoritative
-
-Ensure that manual tags cannot be used to persist computed tags:
-
-* Manual tags are cleaned via computed-tag registry matching before save
-* Reject client writes to `computedTags`
-
-This should remain true for all write paths.
-
-* Files:
-  * `backend/app/services/tagging_service.py`
-  * `backend/app/api/v1/ground_truths.py`
-  * `backend/app/api/v1/assignments.py`
-* Success:
-  * Tests cover attempts to submit computed tags in `manualTags`
-
-## Phase 4: API contracts and frontend expectations
-
-### Task 4.1: Confirm `/v1/tags/schema` and frontend behavior under the toggle
-
-When exclusivity enforcement is disabled server-side, decide whether the frontend should:
-
-* Still enforce exclusivity based on `/v1/tags/schema` (recommended default), or
-* Also disable client-side exclusivity when the backend flag is off (requires exposing flag to frontend)
-
-* Files:
-  * `backend/app/api/v1/tags.py` (schema endpoint)
-  * `frontend/src/services/tags.ts` (exclusive-group validation)
-  * `backend/app/api/v1/config.py` (if exposing flag)
-* Success:
-  * Frontend and backend behave consistently (or intentionally diverge, but documented)
-
-### Task 4.2: Confirm `/v1/tags` response contract (optional / follow-up)
-
-This is not required for the exclusivity MVP, but keep this as a follow-up if API payload cleanup is desired.
-
-* Files:
-  * `backend/app/api/v1/tags.py`
-  * `backend/app/domain/tags.py`
-  * `frontend/src/services/tags.ts`
-* Success:
-  * No frontend breaking changes when new tags are introduced
-
-## Phase 5: Verification and documentation
-
-### Task 5.1: Add/adjust tests for exclusivity toggle
-
-Cover the chosen policy with tests:
-
-* Unit tests for exclusivity enabled vs disabled
-* Integration tests for write paths (ground truths + assignments) when exclusivity is disabled
-* Frontend test/behavior note: ensure UX does not unexpectedly diverge from backend
-
-* Files:
-  * `backend/tests/unit/`
-  * `backend/tests/integration/`
-* Success:
-  * Tests clearly assert which flows allow unknown tags vs require registry membership
-
-### Task 5.2: Document configuration and operations
-
-Add docs describing:
-
-* How to configure allowed manual tags
-* How tag registry should be managed in dev/test/prod
-* How strict validation affects bulk import and interactive edits
-
-* Files:
-  * `backend/README.md` (or a new doc under `backend/docs/`)
-* Success:
-  * A new team member can configure tags without reading code
diff --git a/.copilot-tracking/plans/20260116-export-pipeline-design-plan.instructions.md b/.copilot-tracking/plans/20260116-export-pipeline-design-plan.instructions.md
deleted file mode 100644
index 80c8c89..0000000
--- a/.copilot-tracking/plans/20260116-export-pipeline-design-plan.instructions.md
+++ /dev/null
@@ -1,97 +0,0 @@
----
-applyTo: '.copilot-tracking/changes/20260116-export-pipeline-design-changes.md'
-description: Task checklist for designing an export pipeline (processors/formatters) for Ground Truth Curator
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-# Task Checklist: Export Pipeline Design
-
-## Overview
-
-Design a pluggable export pipeline (processors + formatters) that preserves current snapshot export behaviors while enabling additional formats and a multi-backend storage interface (Azure Blob as the initial concrete backend).
-
-Follow the repository workflow guidance in `AGENTS.md` (Jujutsu commit workflow) and keep a running record of work in `.copilot-tracking/changes/20260116-export-pipeline-design-changes.md` during implementation.
-
-## Objectives
-
-- Preserve the existing snapshot write and download behaviors while introducing an extensible pipeline for export transforms and formats
-- Define processor/formatter/registry abstractions, configuration, and a minimal initial export slice (JSON)
-- Define an export storage interface that supports multiple backends, with Azure Blob Storage as the first implementation target
-
-## Research Summary
-
-### Project files
-
-- `backend/app/services/snapshot_service.py` - current snapshot artifact writer and in-memory snapshot payload builder
-- `backend/app/api/v1/ground_truths.py` - snapshot routes (write + downloadable attachment)
-- `backend/app/adapters/storage/base.py` and `backend/app/adapters/storage/local_fs.py` - existing (currently underused) storage abstraction
-- `frontend/src/services/groundTruths.ts` - frontend download expectations for `Content-Disposition`
-- `docs/computed-tags-design.md` - proposes processor/formatter export pipeline architecture
-- `docs/json-export-migration-plan.md` - documents JSON (not JSONL) export expectations
-
-### External references
-
-- `.copilot-tracking/research/20260116-export-pipeline-design-research.md` - verified repo findings and proposed pipeline shape
-- FastAPI custom responses (StreamingResponse/FileResponse): https://fastapi.tiangolo.com/advanced/custom-response/
-- Azure Storage Blobs client library for Python (auth patterns, async clients): https://learn.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme
-- Azure Blob Storage Python quickstart (managed identity / DefaultAzureCredential): https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
-
-### Standards references
-
-- `AGENTS.md` - repo workflow and expectations
-- `backend/CODEBASE.md` - backend layering and conventions
-
-## Implementation Checklist
-
-### [ ] Phase 1: Requirements and compatibility
-
-- [ ] Task 1.1: Document the current export behavior baseline
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 15-40)
-
-- [ ] Task 1.2: Decide the v1 export pipeline API surface
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 42-73)
-
-### [ ] Phase 2: Pipeline abstractions
-
-- [ ] Task 2.1: Specify processor and formatter interfaces
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 77-108)
-
-- [ ] Task 2.2: Specify registries and configuration strategy
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 110-141)
-
-### [ ] Phase 3: Execution flow
-
-- [ ] Task 3.1: Specify export execution orchestration
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 145-174)
-
-- [ ] Task 3.2: Define initial processors and formatters
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 176-206)
-
-### [ ] Phase 4: Storage targets (multi-backend)
-
-- [ ] Task 4.1: Define a multi-backend export storage interface
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 210-247)
-
-- [ ] Task 4.2: Specify Azure Blob configuration and authentication strategy
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 249-280)
-
-- [ ] Task 4.3: Define delivery strategy for Blob-hosted artifacts
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 282-311)
-
-### [ ] Phase 5: Tests and rollout
-
-- [ ] Task 5.1: Add test strategy for pipeline configuration and outputs
-  - Details: `.copilot-tracking/details/20260116-export-pipeline-design-details.md` (Lines 315-342)
-
-## Dependencies
-
-- Python 3.11 + FastAPI + Pydantic v2 (already present)
-- Existing GroundTruthRepo data access patterns and snapshot tests
-- Azure Blob dependencies and configuration when implementing the Blob backend (e.g., `azure-storage-blob` + `azure-identity`)
-
-## Success Criteria
-
-- The export pipeline design is documented with clear interfaces, configuration, and a minimal initial format set (JSON)
-- Snapshot endpoints remain backward compatible
-- The design includes a clear path to add processors, new formats, and new storage targets without rewriting the core flow
diff --git a/.copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md b/.copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md
deleted file mode 100644
index 7c9b7ce..0000000
--- a/.copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md
+++ /dev/null
@@ -1,94 +0,0 @@
----
-applyTo: '.copilot-tracking/changes/20260116-export-pipeline-implementation-changes.md'
----
-<!-- markdownlint-disable-file -->
-# Task Checklist: Export Pipeline Implementation
-
-## Overview
-
-Implement the export pipeline architecture from the designs and wire it into the existing snapshot endpoints without breaking backward compatibility.
-
-Follow repository workflow guidance from #file:../../AGENTS.md
-
-## Objectives
-
-* Preserve existing snapshot endpoint behavior (routes, payload keys, and `Content-Disposition` semantics).
-* Introduce export pipeline components (models, registries, processors, formatters, delivery modes, and storage backends) with unit test coverage.
-* Add Blob storage support behind explicit settings and a backend selector, without impacting local defaults.
-
-## Research Summary
-
-### Project Files
-
-* .copilot-tracking/research/20260116-export-pipeline-implementation-research.md - Verified current snapshot contracts, coupling to frontend download behavior, and concrete code touchpoints.
-* backend/app/api/v1/ground_truths.py - Snapshot endpoints that must remain backward compatible.
-* backend/app/services/snapshot_service.py - Current snapshot artifact write and payload build behavior.
-* backend/app/core/config.py - Settings strictness (`extra="forbid"`) that requires explicit new env vars.
-* frontend/src/services/groundTruths.ts - Parses `Content-Disposition` to derive download filename.
-* docs/computed-tags-design.md - Export pipeline architecture requirements.
-
-### External References
-
-* .copilot-tracking/research/20260116-export-pipeline-implementation-research.md - Captures SDK usage patterns and response behavior references.
-
-## Implementation Checklist
-
-### [x] Phase 1: Lock down compatibility contract
-
-* [x] Task 1.1: Confirm snapshot endpoint contracts (write + download)
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 15-37)
-
-* [x] Task 1.2: Define compatibility-safe defaults for pipeline adoption
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 39-60)
-
-### [x] Phase 2: Build pipeline core (registries + request models)
-
-* [x] Task 2.1: Add export pipeline request/option models
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 64-84)
-
-* [x] Task 2.2: Implement processor and formatter registries
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 86-109)
-
-### [x] Phase 3: Implement initial processors and formatters
-
-* [x] Task 3.1: Implement processor `merge_tags`
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 113-130)
-
-* [x] Task 3.2: Implement formatters `json_items` and `json_snapshot_payload`
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 132-152)
-
-### [x] Phase 4: Storage backends and delivery modes
-
-* [x] Task 4.1: Define and implement an export storage interface
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 156-178)
-
-* [x] Task 4.2: Add Azure Blob storage backend
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 180-204)
-
-* [x] Task 4.3: Implement delivery modes (attachment/artifact/stream)
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 206-227)
-
-### [x] Phase 5: Wire into container, API, and tests
-
-* [x] Task 5.1: Wire registries, storage, and pipeline via container
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 231-245)
-
-* [x] Task 5.2: Update snapshot service and routes to use pipeline internally
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 247-266)
-
-* [x] Task 5.3: Add new unit tests for pipeline components
-  * Details: .copilot-tracking/details/20260116-export-pipeline-implementation-details.md (Lines 268-289)
-
-## Dependencies
-
-* Backend Python environment (per backend/pyproject.toml)
-* Azure SDK packages:
-  * `azure-identity` (already present)
-  * `azure-storage-blob` (to add for Blob backend)
-* Local dev: Cosmos Emulator and existing integration test environment (as already used by repo)
-
-## Success Criteria
-
-* Existing snapshot unit/integration tests pass unchanged.
-* Export pipeline core exists and is covered by new unit tests.
-* Blob backend is selectable via explicit settings and does not impact default local behavior.
diff --git a/.copilot-tracking/plans/20260116-manual-tags-design-plan.instructions.md b/.copilot-tracking/plans/20260116-manual-tags-design-plan.instructions.md
deleted file mode 100644
index b0b1069..0000000
--- a/.copilot-tracking/plans/20260116-manual-tags-design-plan.instructions.md
+++ /dev/null
@@ -1,91 +0,0 @@
----
-applyTo: '.copilot-tracking/changes/20260116-manual-tags-design-changes.md'
----
-<!-- markdownlint-disable-file -->
-# Task Checklist: Manual Tags Design
-
-## Overview
-
-Define and implement a consistent, testable manual-tags design focused on configurable mutual exclusivity within tag groups, while keeping API contracts stable for the frontend.
-
-Follow all instructions from #file:../../.github/instructions/task-implementation.instructions.md
-If that file is not present in this repository, follow `AGENTS.md` for the repo workflow and the workspace-wide instructions configured in VS Code.
-
-## Objectives
-
-* Define the manual tag validation policy for interactive writes vs bulk import
-* Implement a backend configuration flag to enable/disable exclusivity enforcement for `exclusive=True` groups
-* Ensure manual tags cannot collide with computed tags and that API contracts remain stable for the frontend
-
-## Research Summary
-
-### Project Files
-
-* `.copilot-tracking/research/20260116-manual-tags-design-research.md` - Verified current behavior, key files, and decision points
-* `backend/app/api/v1/tags.py` - `/v1/tags` and `/v1/tags/schema` behavior and allowlist override
-* `backend/app/services/tagging_service.py` - Tag normalization and validation helpers
-* `backend/app/services/validation_service.py` - Bulk import strict validation with a registry-derived allow-set
-* `frontend/src/services/tags.ts` - Frontend expectations for schema and tag list payloads
-* `docs/computed-tags-design.md` - Overall tags split model and computed tag constraints
-* `backend/docs/tagging_plan.md` - Original intent and constraints for tag schema/rules
-
-### External References
-
-* `.copilot-tracking/research/20260116-manual-tags-design-research.md` (Lines 160-169) - Reference links
-* <https://docs.pydantic.dev/latest/concepts/validators/> - Pydantic v2 field validator patterns
-* <https://fastapi.tiangolo.com/tutorial/response-model/> - FastAPI response model behavior
-* <https://learn.microsoft.com/azure/cosmos-db/partitioning-overview> - Cosmos DB partitioning constraints relevant to the global tags container
-
-## Implementation Checklist
-
-### [ ] Phase 1: Confirm requirements and align policy
-
-* [ ] Task 1.1: Decide manual tag validation mode(s) (MVP)
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 18-53)
-
-* [ ] Task 1.2: Decide the source of truth for “allowed manual tags” (optional / follow-up)
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 54-73)
-
-### [ ] Phase 2: Implement configurable exclusivity (backend)
-
-* [ ] Task 2.1: Add a config flag to enable/disable exclusivity enforcement
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 76-97)
-
-* [ ] Task 2.2: Document and expose the exclusivity flag (if needed)
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 98-110)
-
-### [ ] Phase 3: Validation and normalization improvements
-
-* [ ] Task 3.1: Ensure exclusivity toggle does not affect computed-tags invariants
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 113-125)
-
-* [ ] Task 3.2: Confirm computed tag stripping remains authoritative
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 126-141)
-
-### [ ] Phase 4: API contracts and frontend expectations
-
-* [ ] Task 4.1: Confirm `/v1/tags/schema` and frontend behavior under the toggle
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 144-157)
-
-* [ ] Task 4.2: Confirm `/v1/tags` response contract (optional / follow-up)
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 158-168)
-
-### [ ] Phase 5: Verification and documentation
-
-* [ ] Task 5.1: Add/adjust tests for exclusivity toggle
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 171-184)
-
-* [ ] Task 5.2: Document configuration and operations
-  * Details: `.copilot-tracking/details/20260116-manual-tags-design-details.md` (Lines 185-196)
-
-## Dependencies
-
-* Python 3.11, `uv`, FastAPI, Pydantic v2
-* Azure Cosmos DB (or emulator) when validating tags registry persistence
-* Frontend build toolchain (Vite) to verify tag-picker behavior end-to-end
-
-## Success Criteria
-
-* Manual tag policy is explicitly defined and implemented consistently across all write paths
-* The `/v1/tags` and `/v1/tags/schema` contracts remain stable and match frontend expectations
-* Provider selection is test-covered and computed-tag collisions are prevented at startup and on write
diff --git a/.copilot-tracking/prompts/implement-export-pipeline-design.prompt.md b/.copilot-tracking/prompts/implement-export-pipeline-design.prompt.md
deleted file mode 100644
index cb7bb1b..0000000
--- a/.copilot-tracking/prompts/implement-export-pipeline-design.prompt.md
+++ /dev/null
@@ -1,47 +0,0 @@
----
-description: Implementation prompt for executing the export pipeline design plan
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-# Implementation Prompt: Export Pipeline Design
-
-## Implementation Instructions
-
-### Step 1: Create changes tracking file
-
-You WILL create `20260116-export-pipeline-design-changes.md` in `.copilot-tracking/changes/` if it does not exist.
-
-### Step 2: Execute implementation
-
-You WILL follow the repository workflow guidance in `AGENTS.md` (Jujutsu commit workflow).
-
-You WILL systematically implement `../plans/20260116-export-pipeline-design-plan.instructions.md` task-by-task.
-
-CRITICAL: If ${input:phaseStop:true} is true, you WILL stop after each Phase for user review.
-
-CRITICAL: If ${input:taskStop:false} is true, you WILL stop after each Task for user review.
-
-### Step 3: Cleanup
-
-When ALL Phases are checked off (`[x]`) and completed you WILL do the following:
-
-1. You WILL provide a markdown style link and a summary of all changes from #file:../changes/20260116-export-pipeline-design-changes.md to the user:
-   - You WILL keep the overall summary brief
-   - You WILL add spacing around any lists
-   - You MUST wrap any reference to a file in a markdown style link
-
-2. You WILL provide markdown style links to:
-   - `.copilot-tracking/plans/20260116-export-pipeline-design-plan.instructions.md`
-   - `.copilot-tracking/details/20260116-export-pipeline-design-details.md`
-   - `.copilot-tracking/research/20260116-export-pipeline-design-research.md`
-
-3. MANDATORY: You WILL attempt to delete `.copilot-tracking/prompts/implement-export-pipeline-design.prompt.md`
-
-## Success Criteria
-
-- [ ] Changes tracking file created
-- [ ] All plan items implemented with working code
-- [ ] All detailed specifications satisfied
-- [ ] Snapshot endpoints remain backward compatible
-- [ ] Changes file updated continuously
diff --git a/.copilot-tracking/prompts/implement-export-pipeline-implementation.prompt.md b/.copilot-tracking/prompts/implement-export-pipeline-implementation.prompt.md
deleted file mode 100644
index d6bc129..0000000
--- a/.copilot-tracking/prompts/implement-export-pipeline-implementation.prompt.md
+++ /dev/null
@@ -1,38 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Implementation Prompt: Export Pipeline Implementation
-
-## Implementation Instructions
-
-### Step 1: Create Changes Tracking File
-
-You WILL create `20260116-export-pipeline-implementation-changes.md` in `.copilot-tracking/changes/` if it does not exist.
-
-### Step 2: Execute Implementation
-
-You WILL follow repository workflow guidance from #file:../../AGENTS.md
-You WILL systematically implement #file:../plans/20260116-export-pipeline-implementation-plan.instructions.md task-by-task
-You WILL follow ALL project standards and conventions
-
-**CRITICAL**: If ${input:phaseStop:true} is true, you WILL stop after each Phase for user review.
-**CRITICAL**: If ${input:taskStop:false} is true, you WILL stop after each Task for user review.
-
-### Step 3: Cleanup
-
-When ALL Phases are checked off (`[x]`) and completed you WILL do the following:
-
-1. You WILL provide a markdown style link and a summary of all changes from #file:../changes/20260116-export-pipeline-implementation-changes.md to the user:
-   * You WILL keep the overall summary brief
-   * You WILL add spacing around any lists
-   * You MUST wrap any reference to a file in a markdown style link
-
-2. You WILL provide markdown style links to .copilot-tracking/plans/20260116-export-pipeline-implementation-plan.instructions.md, .copilot-tracking/details/20260116-export-pipeline-implementation-details.md, and .copilot-tracking/research/20260116-export-pipeline-implementation-research.md documents.
-
-3. **MANDATORY**: You WILL attempt to delete .copilot-tracking/prompts/implement-export-pipeline-implementation.prompt.md
-
-## Success Criteria
-
-* [ ] Changes tracking file created
-* [ ] All plan items implemented with working code
-* [ ] All detailed specifications satisfied
-* [ ] Project conventions followed
-* [ ] Changes file updated continuously
diff --git a/.copilot-tracking/prompts/implement-manual-tags-design.prompt.md b/.copilot-tracking/prompts/implement-manual-tags-design.prompt.md
deleted file mode 100644
index b6e2902..0000000
--- a/.copilot-tracking/prompts/implement-manual-tags-design.prompt.md
+++ /dev/null
@@ -1,44 +0,0 @@
----
-title: Implementation Prompt - Manual Tags Design
-description: Execution prompt for implementing the manual tags design plan
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-# Implementation Prompt: Manual Tags Design
-
-## Implementation Instructions
-
-### Step 1: Create Changes Tracking File
-
-You WILL create `20260116-manual-tags-design-changes.md` in `.copilot-tracking/changes/` if it does not exist.
-
-### Step 2: Execute Implementation
-
-You WILL follow #file:../../.github/instructions/task-implementation.instructions.md
-If that file is not present in this repository, you WILL follow `AGENTS.md` for the repo workflow and the workspace-wide instructions configured in VS Code.
-You WILL systematically implement #file:../plans/20260116-manual-tags-design-plan.instructions.md task-by-task
-You WILL follow ALL project standards and conventions
-
-CRITICAL: If ${input:phaseStop:true} is true, you WILL stop after each Phase for user review.
-CRITICAL: If ${input:taskStop:false} is true, you WILL stop after each Task for user review.
-
-### Step 3: Cleanup
-
-When ALL Phases are checked off (`[x]`) and completed you WILL do the following:
-
-1. You WILL provide a markdown style link and a summary of all changes from #file:../changes/20260116-manual-tags-design-changes.md to the user:
-   * You WILL keep the overall summary brief
-   * You WILL add spacing around any lists
-   * You MUST wrap any reference to a file in a markdown style link
-
-2. You WILL provide markdown style links to `.copilot-tracking/plans/20260116-manual-tags-design-plan.instructions.md`, `.copilot-tracking/details/20260116-manual-tags-design-details.md`, and `.copilot-tracking/research/20260116-manual-tags-design-research.md` documents. You WILL recommend cleaning these files up as well.
-
-3. MANDATORY: You WILL attempt to delete `.copilot-tracking/prompts/implement-manual-tags-design.prompt.md`
-
-## Success Criteria
-
-* [ ] Changes tracking file created
-* [ ] All plan items implemented with working code
-* [ ] All detailed specifications satisfied
-* [ ] Project conventions followed
-* [ ] Changes file updated continuously
diff --git a/.copilot-tracking/research/20260116-export-pipeline-design-research.md b/.copilot-tracking/research/20260116-export-pipeline-design-research.md
deleted file mode 100644
index 02e8339..0000000
--- a/.copilot-tracking/research/20260116-export-pipeline-design-research.md
+++ /dev/null
@@ -1,221 +0,0 @@
----
-description: Research findings to support an export pipeline design plan for Ground Truth Curator
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-# Research: Export Pipeline Design
-
-## Tooling notes (how findings were verified)
-
-- Workspace search: `file_search` and `grep_search` were used to locate existing snapshot/export routes, services, and tests.
-- File inspection: `read_file` was used to review the current implementations and confirm actual behavior.
-- External references: `fetch_webpage` was used to pull verified FastAPI documentation for streaming/file download responses.
-
-## Scope
-
-Define an export pipeline architecture that supports:
-
-- The existing snapshot export behaviors (write artifacts + download as attachment)
-- Multiple output formats (at least JSON; optionally CSV/JSONL later)
-- Pluggable transformations (processors) and final serialization (formatters)
-- Future storage targets (local filesystem today; Blob later)
-
-Additional requirement (user-provided):
-
-- The export endpoint (pipeline-based export) should support multiple backends via an interface/adapter layer.
-- The initial concrete storage backend should point to Azure Blob Storage.
-
-## Verified repo findings (current state)
-
-### Existing export behaviors
-
-- There is a snapshot export write path:
-  - Endpoint: `POST /v1/ground-truths/snapshot`
-  - Implementation: `SnapshotService.export_json()` writes per-item JSON files and a `manifest.json` under `./exports/snapshots/{ts}/`.
-  - Source: `backend/app/services/snapshot_service.py`
-
-Minimal excerpt (write artifacts):
-
-```python
-async def export_json(self) -> dict[str, str | int]:
-    ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
-    out_dir = self.base_dir / ts
-    out_dir.mkdir(parents=True, exist_ok=True)
-```
-
-- There is a snapshot download/read path:
-  - Endpoint: `GET /v1/ground-truths/snapshot`
-  - Implementation: builds an in-memory payload containing `{ schemaVersion, snapshotAt, datasetNames, count, filters, items }` and returns it as a JSON attachment (`Content-Disposition`).
-  - Source: `backend/app/api/v1/ground_truths.py`
-
-Minimal excerpt (attachment header):
-
-```python
-return JSONResponse(
-    content=payload,
-    media_type="application/json",
-    headers={"Content-Disposition": f'attachment; filename="{filename}"'},
-)
-```
-
-- The frontend expects `Content-Disposition` for snapshot downloads and derives a filename from it.
-  - Source: `frontend/src/services/groundTruths.ts`
-
-Minimal excerpt (derives filename from header):
-
-```ts
-const cd = res.headers.get("Content-Disposition") || res.headers.get("content-disposition") || "";
-const match = cd.match(/filename\*?=(?:UTF-8''|")?([^";]+)"?/i);
-```
-
-### Existing “storage adapter” building blocks
-
-- A `SnapshotStorage` protocol exists with `write_json(path, obj)`.
-  - Source: `backend/app/adapters/storage/base.py`
-
-- A local filesystem implementation exists.
-  - Source: `backend/app/adapters/storage/local_fs.py`
-
-- The current `SnapshotService` bypasses the `SnapshotStorage` abstraction and writes directly via `pathlib.Path`.
-  - Source: `backend/app/services/snapshot_service.py`
-
-### Azure Blob readiness (verified)
-
-- The backend configuration currently does not define any Blob-related settings (no `BLOB_*` fields).
-  - Source: `backend/app/core/config.py`
-
-- The backend dependency set currently does not include `azure-storage-blob` in `pyproject.toml`.
-  - Source: `backend/pyproject.toml`
-
-- The backend already includes `azure-identity` as a dependency, which can be used to authenticate to Azure Blob via `DefaultAzureCredential`.
-  - Source: `backend/pyproject.toml`
-
-- Project documentation anticipates Azure Blob support (account URL + container) and a future adapter module.
-  - Source: `backend/docs/fastapi-implementation-plan.md`
-
-### Existing docs influencing export
-
-- `docs/computed-tags-design.md` proposes an export processor / formatter pipeline:
-  - Processors: list-in/list-out transforms (merge tags, anonymize, split/explode, etc.)
-  - Formatters: final conversion to bytes/string (CSV, JSON)
-  - Configuration: env var ordering (e.g., `EXPORT_PROCESSOR_ORDER`)
-
-- `docs/json-export-migration-plan.md` documents a prior decision to move away from JSONL assumptions and keep snapshot artifacts as JSON.
-
-## Current code patterns that the export pipeline should align with
-
-### Serialization conventions
-
-- The backend consistently uses Pydantic v2 with `model_dump(mode="json", by_alias=True, exclude_none=True)`.
-  - Example usage in `SnapshotService.build_snapshot_payload()` and `SnapshotService.export_json()`.
-
-### Container wiring
-
-- Services are constructed in `backend/app/container.py` and injected through a singleton `container` referenced by routers.
-  - Snapshot route calls `container.snapshot_service.*`.
-
-### Computed tags and export
-
-- Computed tags are applied on write paths via `apply_computed_tags()`.
-  - This suggests exports should be explicit about whether they export:
-    - Raw stored fields (`manualTags`, `computedTags`), or
-    - A merged/derived field (`tags`) for downstream consumer compatibility.
-
-## Gaps / constraints
-
-- There is no generalized export endpoint or service beyond “snapshot”; exports are coupled to approved items only.
-- There is no generic way to chain transformations (processors) or select output formats.
-- The storage abstraction exists but is not currently used by `SnapshotService`.
-- Download snapshot currently builds the full payload in memory; for large exports, streaming (or generating an artifact and returning it) may be preferable.
-
-Additional gaps (for Blob-first implementation):
-
-- No Azure Blob adapter implementation exists in `backend/app/` today.
-- Settings are strict (`extra="forbid"`), so Blob env vars must be explicitly added to `Settings` before they can be used.
-- `azure-storage-blob` must be added as a runtime dependency before implementing the adapter.
-
-## Proposed export pipeline architecture (evidence-based)
-
-This design combines:
-
-- The plugin-based processor/formatter approach from `docs/computed-tags-design.md`
-- The concrete snapshot behaviors already implemented (`SnapshotService`)
-- Standard FastAPI patterns for file downloads and streaming
-
-### Core concepts
-
-- **ExportJob input**: filters (dataset/status/tags), format selection, and processor list.
-- **ExportRecord**: a dict-like representation (or a strongly typed DTO) produced from `GroundTruthItem.model_dump(..., by_alias=True)`.
-- **ExportProcessor**: `List[dict] -> List[dict]` transformations.
-- **ExportFormatter**: `List[dict] -> bytes | str` final serialization.
-- **ExportTarget/Storage**: writes artifacts (local fs today; Blob later).
-
-### API surface recommendations
-
-- Keep the existing snapshot routes stable for backward compatibility.
-- Add a new export endpoint that makes the pipeline explicit, e.g.:
-  - `GET /v1/exports/ground-truths?format=json&dataset=...&status=approved&processors=merge_tags,anonymize`
-  - or `POST /v1/exports/ground-truths` with a request body defining filters and options.
-
-### Streaming / large payload guidance
-
-FastAPI supports returning file-like responses without buffering whole payloads.
-
-- `FileResponse` can stream a generated artifact and sets `Content-Disposition` when `filename=` is provided.
-- `StreamingResponse` can stream bytes from a generator if you want to avoid writing to disk first.
-
-External reference:
-- FastAPI “Custom Response - HTML, Stream, File, others” (`StreamingResponse`, `FileResponse`):
-  - https://fastapi.tiangolo.com/advanced/custom-response/
-
-Verified examples from the FastAPI docs (high level):
-
-- `StreamingResponse(generator(), media_type=...)` for streaming bytes from an iterator/generator.
-- `FileResponse(path, filename=...)` for sending a file with `Content-Disposition`.
-
-## Compatibility and evolution plan
-
-- Phase 1: implement processors/formatters and keep output JSON-compatible with current snapshot payload and/or current per-item JSON artifacts.
-- Phase 2: integrate a generalized export storage interface with Azure Blob as the initial concrete implementation (optionally keep local filesystem for dev/test).
-- Phase 3: optional support for asynchronous/batched exports for very large datasets (queue + polling).
-
-## Concrete implementation guidance (what we should standardize)
-
-- Naming conventions:
-  - Processor names and formatter names should be lowercase and stable (e.g., `merge_tags`, `anonymize`, `json_items`, `json_snapshot_payload`).
-
-- Deterministic output (testability):
-  - Prefer stable key ordering for manifests where it matters; otherwise rely on JSON comparison via parsed objects.
-  - Avoid non-deterministic timestamps in unit tests by injecting a clock or allowing `snapshotAt` override.
-
-- Cosmos query considerations (future):
-  - Filter by dataset/bucket when possible to avoid cross-partition scans.
-  - If exports become “all datasets”, make it explicit and guarded.
-
-## Files most relevant to this task
-
-- `backend/app/services/snapshot_service.py`
-- `backend/app/api/v1/ground_truths.py`
-- `backend/app/adapters/storage/base.py`
-- `backend/app/adapters/storage/local_fs.py`
-- `docs/computed-tags-design.md`
-- `docs/json-export-migration-plan.md`
-- `frontend/src/services/groundTruths.ts`
-
-## External references: Azure Blob Storage (Python SDK)
-
-These references support the Blob-first storage backend plan and provide verified SDK/auth patterns.
-
-- Azure Storage Blobs client library for Python (overview + credential options + async notes):
-  - https://learn.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme
-
-- Quickstart (managed identity / `DefaultAzureCredential` example):
-  - https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python?tabs=managed-identity%2Cazure-portal
-
-Key takeaways for this repo’s planned adapter:
-
-- Client creation supports AAD token credentials (e.g., `DefaultAzureCredential`) with `account_url`, which aligns with using Managed Identity in production.
-- The SDK provides async clients under `azure.storage.blob.aio`, but requires an async transport (commonly `aiohttp`) to be installed.
-- A Blob-first delivery option can be implemented either by proxying downloads via the backend (preserve `Content-Disposition`) or by returning a SAS URL (client downloads directly).
-
diff --git a/.copilot-tracking/research/20260116-export-pipeline-implementation-research.md b/.copilot-tracking/research/20260116-export-pipeline-implementation-research.md
deleted file mode 100644
index 706e3e3..0000000
--- a/.copilot-tracking/research/20260116-export-pipeline-implementation-research.md
+++ /dev/null
@@ -1,166 +0,0 @@
----
-description: Research findings to support implementing the export pipeline design in code (backend-first)
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-# Research: Export Pipeline Implementation
-
-## Tooling notes (how findings were verified)
-
-- Workspace search: `file_search`, `grep_search`, and `semantic_search` were used to locate snapshot/export routes, services, storage adapters, and tests.
-- File inspection: `read_file` was used to confirm current endpoint behavior and service implementations.
-- External references: `fetch_webpage` was used to pull verified guidance for FastAPI `FileResponse`/`StreamingResponse` and Azure Blob Storage Python SDK usage patterns.
-
-## Scope
-
-Implement the export pipeline architecture described in `docs/computed-tags-design.md` (Section 4.4) into the backend codebase while preserving existing snapshot endpoint behaviors.
-
-Out of scope for the first milestone (unless needed for compatibility):
-
-- New export endpoints beyond the existing snapshot routes
-- Export job orchestration (async background jobs, polling endpoints)
-- Additional formats (CSV/JSONL/ZIP) beyond the initial JSON formats described in the design
-
-## Verified repo findings (current state)
-
-### Snapshot routes and stable behaviors
-
-Backend snapshot endpoints exist and are currently relied upon by tests and the frontend:
-
-- `POST /v1/ground-truths/snapshot`
-  - Implementation: calls `SnapshotService.export_json()`
-  - Writes per-item JSON artifacts and a `manifest.json` under `exports/snapshots/{ts}/`
-  - Source: `backend/app/api/v1/ground_truths.py`, `backend/app/services/snapshot_service.py`
-
-- `GET /v1/ground-truths/snapshot`
-  - Implementation: returns an `application/json` payload with `Content-Disposition: attachment; filename="ground-truth-snapshot-<ts>.json"`
-  - Payload shape includes `schemaVersion`, `snapshotAt`, `datasetNames`, `count`, `filters`, `items`
-  - Source: `backend/app/api/v1/ground_truths.py`, `backend/app/services/snapshot_service.py`
-
-These behaviors are verified by tests:
-
-- Artifact write verification: `backend/tests/integration/test_snapshot_artifacts_cosmos.py`
-- Download endpoint behavior: `backend/tests/integration/ground_truths/test_snapshot_download_endpoint.py`
-- Payload shape/unit behavior: `backend/tests/unit/test_snapshot_service.py`
-
-### Frontend coupling
-
-Frontend snapshot download depends on the backend providing `Content-Disposition` for a filename.
-
-- Source: `frontend/src/services/groundTruths.ts` (parses `Content-Disposition` to derive filename)
-
-### Existing storage abstraction (partial)
-
-There is a small storage protocol already:
-
-- `SnapshotStorage` protocol with `write_json(path, obj)`
-  - Source: `backend/app/adapters/storage/base.py`
-- `LocalFilesystemStorage` implementation
-  - Source: `backend/app/adapters/storage/local_fs.py`
-
-However, current snapshot code writes directly to disk via `pathlib.Path` and does not use the storage protocol.
-
-### Dependency and configuration constraints
-
-- Backend settings enforce `extra="forbid"`, so any new env vars must be explicitly added.
-  - Source: `backend/app/core/config.py`
-- Backend dependencies include `azure-identity` but do not include `azure-storage-blob`.
-  - Source: `backend/pyproject.toml`
-
-### Existing "registry" patterns to follow
-
-Computed tags are implemented with:
-
-- Interface + registry (`ComputedTagPlugin`, `TagPluginRegistry`)
-- Auto-discovery of plugin implementations via module scanning
-
-Sources:
-
-- `backend/app/plugins/base.py`
-- `backend/app/plugins/registry.py`
-
-This is a good local precedent for the export processor/formatter registries.
-
-## Design requirements to implement (source of truth)
-
-The implementation should follow `docs/computed-tags-design.md` Section 4.4, including:
-
-- Processor and formatter interfaces:
-  - `ExportProcessor`: list-in/list-out deterministic transforms
-  - `ExportFormatter`: list-in -> `bytes|str` serialization
-- Registries:
-  - Resolve processors and formatters by stable names
-  - Reject duplicates
-  - Unknown names produce a clear 400 error at the API
-- Configuration:
-  - `GTC_EXPORT_PROCESSOR_ORDER` controls default processor order
-- Initial pipeline features:
-  - Processor: `merge_tags` (derive `tags = union(manualTags, computedTags)`)
-  - Formatters: `json_snapshot_payload`, `json_items`
-- Storage interface:
-  - Multi-backend export storage with `local` default and `blob` as initial cloud backend
-  - Stable artifact key layout: `exports/snapshots/{timestamp}/{filename}`
-- Delivery modes:
-  - `attachment`, `artifact`, `stream` (backend sets `Content-Disposition`)
-- Compatibility rule:
-  - Snapshot endpoints must remain backward compatible (payload keys and behavior expectations)
-
-## Implementation mapping (repo-aligned)
-
-### Recommended package layout (new)
-
-Create a backend package (example):
-
-- `backend/app/exports/`
-  - `models.py` (request DTOs, export record type aliases)
-  - `processors/` (merge_tags)
-  - `formatters/` (json_snapshot_payload, json_items)
-  - `registry.py` (processor/formatter registries)
-  - `storage/` (local + blob backends)
-  - `pipeline.py` (execution flow: load -> process -> format -> deliver)
-
-### Router/service integration
-
-- Keep router thin in `backend/app/api/v1/ground_truths.py`.
-- Wire pipeline services through the singleton `container` in `backend/app/container.py`, similar to other services.
-- Update `SnapshotService` to delegate to the pipeline for:
-  - building the snapshot payload
-  - writing artifacts
-
-## External references (verified)
-
-### FastAPI response types
-
-FastAPI (Starlette) supports streaming and file responses for download behavior.
-
-- `StreamingResponse` can stream from an iterator/generator or async generator.
-- `FileResponse` can stream a local file and can set `Content-Disposition` using `filename=...`.
-
-Source: https://fastapi.tiangolo.com/advanced/custom-response/
-
-### Azure Blob Storage SDK (Python)
-
-- `azure-storage-blob` is required for Blob operations; `azure-identity` provides `DefaultAzureCredential`.
-- Async clients exist under `azure.storage.blob.aio` and are intended for use with `asyncio`.
-
-Sources:
-
-- https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
-- https://learn.microsoft.com/en-us/azure/developer/python/sdk/azure-sdk-library-usage-patterns#async
-
-Operational note for local dev:
-
-- Developers typically need the Storage Blob Data Contributor role for their identity to read/write blobs in a dev container.
-
-## Risks and compatibility traps
-
-- Changing the default behavior of `POST /v1/ground-truths/snapshot` could break integration tests (and any external automation).
-- Changing `GET /v1/ground-truths/snapshot` headers or payload keys will break frontend download logic and snapshot tests.
-- Introducing new env vars without updating `Settings` will fail app startup due to `extra="forbid"`.
-
-## Suggested verification approach
-
-- Run unit tests for pipeline components (registries, processors, formatters).
-- Re-run existing snapshot integration tests to ensure the snapshot endpoints remain compatible.
-- Ensure OpenAPI generation remains consistent if request models change (frontend uses generated client types).
diff --git a/.copilot-tracking/research/20260116-manual-tags-design-research.md b/.copilot-tracking/research/20260116-manual-tags-design-research.md
deleted file mode 100644
index 2ec4c17..0000000
--- a/.copilot-tracking/research/20260116-manual-tags-design-research.md
+++ /dev/null
@@ -1,169 +0,0 @@
----
-title: Manual Tags Design Research
-description: Verified findings and references for implementing manual-tags design in GroundTruthCurator
-ms.date: 2026-01-16
----
-<!-- markdownlint-disable-file -->
-
-## Scope
-
-This research covers the current and intended design for **manual tags** in Ground Truth Curator, including:
-
-* Storage shape and validation rules
-* Manual tag discovery (schema + registry + optional allowlist)
-* API surface consumed by the frontend
-* Cosmos DB persistence model for global tag registry
-* Known interaction points with computed tags
-
-## Workspace reconnaissance (verified)
-
-### Tool usage (evidence collection)
-
-The findings above were collected using repository-wide searches and direct file inspection:
-
-* `grep_search` for `manualTags`, `computedTags`, `ALLOWED_MANUAL_TAGS`, `TagValidator`, and related symbols
-* `read_file` of the concrete implementations and tests listed below
-* `fetch_webpage` for the external FastAPI/Pydantic/Cosmos DB references
-
-### Key backend files
-
-* `backend/app/domain/models.py`
-  * `GroundTruthItem.manual_tags` stored as `manualTags`.
-  * `GroundTruthItem.computed_tags` stored as `computedTags`.
-  * `GroundTruthItem.tags` is a computed union for reads.
-
-* `backend/app/domain/validators.py`
-  * Pydantic v2 field validators coerce `manual_tags` and validate via `validate_tags()`.
-  * `computed_tags` are coerced only (no user validation).
-
-* `backend/app/services/tagging_service.py`
-  * Canonicalization rules (`normalize_tag`) enforce `group:value` format.
-  * `validate_tags()` enforces exclusivity/dependency rules for **known** groups.
-  * Unknown groups/values are allowed (format still required).
-  * `validate_tags_with_cache()` provides a stricter mode: manual tags must exist in a provided allow-set.
-
-* `backend/app/domain/tags.py`
-  * Defines `TAG_SCHEMA` for known groups and value sets.
-  * Defines rule plugins (`ExclusiveGroupRule`, `DependencyRule`) applied by `validate_tags()`.
-
-* `backend/app/api/v1/tags.py`
-  * `GET /v1/tags/schema` returns `TAG_SCHEMA` for frontend rendering and client-side validation.
-  * `GET /v1/tags` returns manual tags in `tags` plus computed tag keys in `computedTags`.
-  * When `GTC_ALLOWED_MANUAL_TAGS` is set, `GET /v1/tags` uses it as the manual-tag source-of-truth.
-
-* `backend/app/services/tag_registry_service.py`
-  * Implements add/remove/list over a single global tag list.
-
-* `backend/app/adapters/repos/tags_repo.py`
-  * Cosmos implementation stores a single document `id="tags|global"` in the tags container.
-  * Partition key `/pk` uses constant value `"global"`.
-
-* `backend/app/main.py`
-  * Startup fails fast if `GTC_ALLOWED_MANUAL_TAGS` overlaps static computed tag keys.
-
-### Key frontend files
-
-* `frontend/src/services/tags.ts`
-  * Fetches `GET /v1/tags/schema` and validates exclusive groups client-side.
-  * Fetches `GET /v1/tags` and uses `tags` as manual tags and `computedTags` as computed tags.
-
-### Tests demonstrating current behavior
-
-* `backend/tests/unit/test_groundtruthitem_tags_validation.py`
-  * Confirms unknown groups are allowed for `manualTags`.
-  * Confirms exclusive groups (e.g., `source:*`) reject multiple values.
-
-* `backend/app/services/validation_service.py`
-  * Bulk import validation uses `validate_tags_with_cache()` and the tag registry as the allow-set.
-
-## Current behavior summary (evidence-based)
-
-### Code excerpts (current patterns)
-
-Pydantic v2 validators on `manual_tags` enforce normalization + rule checks:
-
-```python
-@field_validator("manual_tags", mode="before")
-@classmethod
-def _coerce_manual_tags(_cls, v: Any) -> list[str]:
-  return coerce_tags(v)
-
-@field_validator("manual_tags", mode="after")
-@classmethod
-def _validate_manual_tags(_cls, v: list[str]) -> list[str]:
-  return validate_tags(v)
-```
-
-The tags API returns manual tags and computed tag keys separately, with an env override for manual tags:
-
-```python
-if settings.ALLOWED_MANUAL_TAGS:
-  manual_tags = [t.strip() for t in settings.ALLOWED_MANUAL_TAGS.split(",") if t and t.strip()]
-else:
-  manual_tags = await container.tag_registry_service.list_tags()
-
-computed_tag_keys = sorted(get_default_registry().get_static_keys())
-return TagListResponse(tags=sorted(manual_tags), computedTags=computed_tag_keys)
-```
-
-### Canonical format
-
-* Tags must be `group:value`.
-* Canonicalization lowercases, trims whitespace, normalizes `group : value` to `group:value`, and removes empty group/value.
-
-### Validation policy (two-tier)
-
-* **Default API/model validation (relaxed):**
-  * Accepts unknown groups and unknown values.
-  * Enforces exclusivity/dependency rules only for known groups in `TAG_SCHEMA`.
-
-* **Bulk import validation (strict allow-set):**
-  * Requires all manual tags to exist in the global tag registry set.
-  * Still enforces exclusivity/dependency rules.
-
-### Manual tag discovery sources
-
-Manual tags shown to the UI come from one of:
-
-* `GTC_ALLOWED_MANUAL_TAGS` (CSV) when set.
-* Otherwise, the global tag registry (`TagRegistryService` backed by memory or Cosmos).
-
-Known schema groups/values are also provided independently via `GET /v1/tags/schema`.
-
-### Global tag registry storage
-
-* Cosmos tags container stores a single global doc:
-  * `id = "tags|global"`
-  * `pk = "global"`
-  * `tags = ["group:value", ...]`
-
-This is intentionally simple and matches current API semantics (global tags, not per-dataset).
-
-## Gaps / decision points to resolve in the manual-tags “design”
-
-These are the key choices that affect implementation work:
-
-1. **Should runtime writes (PUT ground truths / assignments) be strict allow-set, or remain relaxed?**
-   * Current behavior is relaxed for normal writes, strict for bulk import.
-
-2. **What is the long-term source of truth for “allowed manual tags”?**
-   * Current options: env allowlist or global registry.
-   * A provider abstraction is partially implemented via `GTC_ALLOWED_MANUAL_TAGS` override, but not expressed as a formal interface.
-
-3. **Do we need per-dataset or per-tenant tag registries?**
-   * Current registry is global.
-
-4. **How should manual tags interact with computed tags?**
-   * Startup checks prevent allowlist collisions with computed tags.
-   * Write path strips computed tags from manual tags during `apply_computed_tags()`.
-
-## External references (for implementation correctness)
-
-* Pydantic v2 validators (`field_validator`, before/after modes):
-  * <https://docs.pydantic.dev/latest/concepts/validators/>
-
-* FastAPI `response_model` behavior and filtering:
-  * <https://fastapi.tiangolo.com/tutorial/response-model/>
-
-* Cosmos DB partitioning and logical partition limits (relevant for global tags container design):
-  * <https://learn.microsoft.com/azure/cosmos-db/partitioning-overview>
diff --git a/.copilot-tracking/research/20260121-cosmos-repo-refactor-research.md b/.copilot-tracking/research/20260121-cosmos-repo-refactor-research.md
deleted file mode 100644
index 782256e..0000000
--- a/.copilot-tracking/research/20260121-cosmos-repo-refactor-research.md
+++ /dev/null
@@ -1,180 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Task Research: Cosmos Repo / Service Layer Refactor
-
-Build refactoring research for:
-
-* Logic currently in `cosmos_repo.py` that should live in the service layer instead.
-* Logic currently in API routes/handlers that should live in the service layer instead.
-* A new `cosmos_emulator.py` that inherits from (or wraps) `cosmos_repo.py` and overrides emulator-specific behavior, instead of intermixing emulator conditionals inside `cosmos_repo.py`.
-
-## Task Implementation Requests
-
-* Identify and classify responsibilities currently in `cosmos_repo.py` (pure persistence vs domain/service logic vs emulator quirks).
-* Identify API-layer business logic candidates to move into services.
-* Propose a repo/service/emulator class/module structure, including the specific seams to override for the emulator.
-* Provide actionable refactor steps with exact file references (paths and line ranges).
-
-## Scope and Success Criteria
-
-* Scope:
-  * Backend Python code only.
-  * Focus on Cosmos DB repository + emulator behavior + API handlers.
-* Out of scope:
-  * Frontend changes.
-  * Large behavioral changes; this is a refactor plan.
-* Assumptions:
-  * There is an existing Cosmos repository abstraction used by services and API.
-  * Emulator-specific behavior is currently mixed into production Cosmos codepaths.
-* Success Criteria:
-  * A concrete, evidence-backed map of what to move and where.
-  * One recommended design for `cosmos_repo.py` + `cosmos_emulator.py` and service boundaries.
-  * Refactor steps that minimize risk and avoid breaking dependency injection.
-
-## Outline
-
-1. Convention discovery (repo-specific guidelines, layering conventions)
-2. Current-state inventory
-   * `cosmos_repo.py` responsibilities
-   * Emulator-specific branching points
-   * API endpoints containing business logic
-   * Service layer responsibilities today
-3. Target architecture
-   * Repository interface vs implementation
-   * Emulator-specific subclass/adapter
-   * Service boundaries and orchestration
-4. Migration plan
-   * Mechanical steps
-   * High-risk areas
-   * Suggested tests/verification steps
-
-## Research Executed
-
-### Project Conventions
-
-* Layering is documented as API → Services → Repos/Adapters, composed via a singleton container ([backend/CODEBASE.md](backend/CODEBASE.md#L20-L29)).
-* DI wiring follows a global `container` with an async `startup_cosmos()` initialization path ([backend/app/container.py](backend/app/container.py#L83-L161)).
-* There is existing, explicitly documented emulator/conditional behavior inside the Cosmos repo (e.g., the conditional patch implementation for `assign_to`) ([backend/CONDITIONAL_PATCH_IMPLEMENTATION.md](backend/CONDITIONAL_PATCH_IMPLEMENTATION.md#L11-L22), [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1409-L1609)).
-* Emulator limitations are already recognized and sometimes require alternate query behavior (notably `ARRAY_CONTAINS`) ([backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md#L5-L36)).
-* Emulator Unicode/backslash issues are handled via a feature flag (base64 encoding of `refs[*].content`) ([backend/docs/cosmos-emulator-unicode-workaround.md](backend/docs/cosmos-emulator-unicode-workaround.md#L35-L39)).
-
-### File Analysis
-
-* Repository implementation:
-  * Cosmos repo: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L389-L443)
-  * Repo interface/base: [backend/app/adapters/repos/base.py](backend/app/adapters/repos/base.py#L1-L55)
-* API endpoints with notable workflow logic:
-  * Assignments: [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L78-L232)
-  * Ground truths: [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L105-L154), [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L232-L369)
-* Existing service boundary:
-  * Assignment service: [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py#L44-L146)
-* DI wiring:
-  * Container composition: [backend/app/container.py](backend/app/container.py#L83-L161)
-
-### Code Search Results
-
-* Emulator/compat toggles and fallbacks exist in the repo and influence query shape and/or write behavior:
-  * Pagination/query logic and limitations: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L660-L911)
-  * Unicode/emulator workarounds and retry behavior are present in the repo write-paths (see transform/retry regions): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L500-L590)
-* Conditional patch vs read-modify-replace assignment semantics are implemented in the repo today:
-  * Assignment patch implementation: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1409-L1609)
-  * Assignment fallback path: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1649-L1680)
-
-## Key Discoveries
-
-## Research Inputs
-
-* Conventions: [.copilot-tracking/subagent/20260121/conventions-research.md](.copilot-tracking/subagent/20260121/conventions-research.md)
-* API hotspots: [.copilot-tracking/subagent/20260121/api-logic-research.md](.copilot-tracking/subagent/20260121/api-logic-research.md)
-* Cosmos repo deep dive: [.copilot-tracking/subagent/20260121/cosmos-repo-research.md](.copilot-tracking/subagent/20260121/cosmos-repo-research.md)
-* Consolidated synthesis: [.copilot-tracking/subagent/20260121/synthesis-notes.md](.copilot-tracking/subagent/20260121/synthesis-notes.md)
-
-### Project Structure
-
-* The backend already has an explicit `services/` layer, but some orchestration/workflow logic remains in routers and in the Cosmos repo.
-* The Cosmos repo currently contains both production Cosmos behavior and emulator compatibility behavior.
-
-### Implementation Patterns
-
-* API handlers perform multi-step update workflows (parse → read existing → compute changes → write → post-processing) that are better owned by services to keep business rules testable and reusable.
-* The repo includes conditional patch logic for assignments (optimized for Cosmos) that is known to be incompatible with emulator behavior; this is the clearest subclass override seam.
-
-### Emulator Split Findings
-
-The currently mixed emulator-specific behavior clusters into three themes:
-
-* Query limitations (emulator does not support some predicates/constructs): [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md#L5-L36)
-* Write-path transforms to avoid Unicode/backslash issues: [backend/docs/cosmos-emulator-unicode-workaround.md](backend/docs/cosmos-emulator-unicode-workaround.md#L35-L39)
-* Assignment update semantics (patch vs read-modify-replace): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1409-L1609)
-
-## Technical Scenarios
-
-### Scenario: Split persistence vs service logic
-
-**Requirements:**
-
-* Keep persistence code (query building, paging, RU/diagnostics, container interactions) in repo.
-* Move domain decisions, validation, and orchestration to services.
-
-**Preferred Approach:**
-
-* Keep `cosmos_repo.py` as the production implementation of the existing repo interface.
-* Move workflow/domain decisions into services (thin repo; services orchestrate).
-* Add `cosmos_emulator.py` that subclasses the production repo and overrides only emulator-specific seams.
-
-Recommended override seams for `cosmos_emulator.py` (inheritance-based):
-
-* `is_cosmos_emulator_in_use()`
-* `list_gt_paginated(...)` (force emulator-safe filtering strategy) ([backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L660-L911))
-* `assign_to(...)` (force read-modify-replace; avoid patch predicates) ([backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1409-L1609), [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1649-L1680))
-* `upsert_gt(...)` / `delete_dataset(...)` (centralize emulator-specific retry policy)
-* `_transform_doc_for_write(...)` and `_transform_doc_for_read(...)` (unicode/base64 workaround seam)
-
-Target file tree (conceptual):
-
-```text
-backend/app/adapters/repos/
-  base.py
-  cosmos_repo.py          # production implementation
-  cosmos_emulator.py      # emulator implementation (subclass)
-```
-
-_TBD once we see the actual code structure._
-
-#### Considered Alternatives
-
-* Keep emulator conditionals in `cosmos_repo.py` with flags:
-  * Pros: fewer new files/classes.
-  * Cons: continued intermixing; harder to reason about production behavior and to test.
-* Strategy object injected into repo (instead of subclass):
-  * Pros: explicit seam without inheritance.
-  * Cons: more plumbing and indirection than needed if only a handful of methods differ.
-
-### Scenario: Move API logic to services
-
-**Requirements:**
-
-* API handlers should do: auth/identity extraction, request parsing/validation, response shaping.
-* Services should do: cross-entity workflows, domain decisions, idempotency semantics, event-ish side effects.
-
-Current hotspots (examples) where routers exceed orchestration:
-
-* Assignments workflow logic in the router: [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L78-L232)
-* Ground truth update workflow logic in the router: [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L232-L369)
-* Ground truth list/import validation and workflow logic: [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L105-L154)
-
-Proposed service extraction:
-
-* Introduce a `GroundTruthUpdateService` responsible for the end-to-end update workflow used by multiple endpoints (read, validate, normalize, write, post-process).
-* Move assignment selection/sampling rules fully into the assignment service layer (building on [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py#L44-L146)).
-
-## Recommended Migration Plan (Low-Risk)
-
-1) Introduce typed domain exceptions (stable API-level mapping).
-2) Add `GroundTruthUpdateService` with a single “update workflow” entrypoint.
-3) Switch routers to call the service (handlers become thin orchestration).
-4) Extract request parsing helpers into a shared module (router/service reuse).
-5) Move assignment sampling/selection logic out of the repo into services.
-6) Move derived-field computation (e.g., `totalReferences`) out of the repo into services/domain normalization.
-7) Add `cosmos_emulator.py` (subclass) and select it in the container wiring ([backend/app/container.py](backend/app/container.py#L83-L161)).
-8) Centralize document transforms behind `_transform_doc_for_write/_transform_doc_for_read` seam.
-9) Update tests to target seams (behavior-preserving refactor first).
diff --git a/.copilot-tracking/research/20260121-high-level-requirements-research.md b/.copilot-tracking/research/20260121-high-level-requirements-research.md
deleted file mode 100644
index b1b5fc8..0000000
--- a/.copilot-tracking/research/20260121-high-level-requirements-research.md
+++ /dev/null
@@ -1,141 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Task Research: High-Level Requirements Extraction (Frontend + Backend)
-
-Extract product/system requirements that match the *existing system* and keep them high-level (behavioral), avoiding implementation details. Cover both frontend and backend.
-
-## Task Implementation Requests
-
-* Extract high-level requirements already present in the repo (docs + PRD artifacts)
-* Ensure requirements reflect current frontend and backend capabilities
-* Avoid implementation-specific constraints (frameworks, file structure, concrete endpoints) unless required for behavior
-
-## Scope and Success Criteria
-
-* Scope: Requirements derived from existing repo artifacts (PRD JSON/TXT, README/CODEBASE docs, backend docs, frontend docs).
-* Exclusions: New feature ideation not supported by evidence in repo; low-level implementation steps.
-* Success Criteria:
-  * Requirements are grouped (Product, Frontend UX, Backend/API, Data/Storage, Export, Observability, Testing/Quality)
-  * Each requirement is backed by at least one repo source reference (file + line range)
-  * Requirements are written in “shall/should/may” language and are implementation-agnostic
-
-## Outline
-
-1. Evidence log (what was read)
-2. Consolidated requirement set
-3. Gaps/ambiguities where docs conflict
-4. Recommended next validation questions
-
-## Supporting Research
-
-Detailed extractions and audits used to build this document:
-
-- PRD extraction + match-to-system flags: [.copilot-tracking/subagent/20260121/prd-requirements-research.md](.copilot-tracking/subagent/20260121/prd-requirements-research.md)
-- Frontend capability extraction: [.copilot-tracking/subagent/20260121/frontend-requirements-research.md](.copilot-tracking/subagent/20260121/frontend-requirements-research.md)
-- Backend capability extraction: [.copilot-tracking/subagent/20260121/backend-requirements-research.md](.copilot-tracking/subagent/20260121/backend-requirements-research.md)
-- Repo conventions + sources-of-truth: [.copilot-tracking/subagent/20260121/conventions-and-sources-research.md](.copilot-tracking/subagent/20260121/conventions-and-sources-research.md)
-- Requirements synthesis working doc: [.copilot-tracking/subagent/20260121/consolidated-requirements-synthesis.md](.copilot-tracking/subagent/20260121/consolidated-requirements-synthesis.md)
-- Citation validation for this note: [.copilot-tracking/subagent/20260121/citation-validation.md](.copilot-tracking/subagent/20260121/citation-validation.md)
-- Reference audit (present vs linked): [.copilot-tracking/subagent/20260121/subagent-reference-audit.md](.copilot-tracking/subagent/20260121/subagent-reference-audit.md)
-
-### Potential Next Research
-
-* Identify which PRD items are intentionally deferred vs removed
-  * Reasoning: PRD contains capabilities not currently reflected in frontend/backend docs
-  * Reference: [prd.json](prd.json)
-
-## Research Executed
-
-### Evidence log (sources reviewed)
-- Primary requirements sources: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md), [prd.json](prd.json), [prd-genericize.json](prd-genericize.json), [ralph/ralph-prd.txt](ralph/ralph-prd.txt), [BUSINESS_VALUE.md](BUSINESS_VALUE.md)
-- Frontend behavior and UX invariants: [frontend/CODEBASE.md](frontend/CODEBASE.md#L70-L180), [frontend/README.md](frontend/README.md#L25-L92), [frontend/IMPLEMENTATION_SUMMARY.md](frontend/IMPLEMENTATION_SUMMARY.md#L84-L165)
-- Backend behavior and API semantics: [backend/CODEBASE.md](backend/CODEBASE.md#L14-L35), [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md#L7-L113), [backend/docs/api-write-consolidation-plan.v2.md](backend/docs/api-write-consolidation-plan.v2.md#L62-L67)
-- Assignment workflow (single-item + materialized assignment doc): [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md#L20-L95), [backend/app/adapters/repos/base.py](backend/app/adapters/repos/base.py#L47-L65)
-- Multi-turn backend compatibility: [backend/docs/multi-turn-refs.md](backend/docs/multi-turn-refs.md#L5-L75)
-- Export/snapshot behavior: [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md#L24-L40), [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md#L72-L93), [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md#L117-L127)
-- Tag rules and normalization: [backend/docs/tagging_plan.md](backend/docs/tagging_plan.md#L5-L13), [backend/docs/tagging_plan.md](backend/docs/tagging_plan.md#L54-L61)
-- Cosmos emulator operational constraints + workarounds: [backend/app/main.py](backend/app/main.py#L60-L82), [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md#L1-L25), [backend/docs/cosmos-emulator-unicode-workaround.md](backend/docs/cosmos-emulator-unicode-workaround.md#L35-L38)
-- Observability/telemetry expectations: [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L13-L17), [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L79-L86)
-- Dev user simulation header: [backend/README.md](backend/README.md#L336-L338), [frontend/README.md](frontend/README.md#L27-L32)
-
-### Research executed summary
-- Extracted behavioral requirements from primary requirement sources and current codebase docs (frontend + backend) and selected “contract” docs in backend `docs/`.
-- Validated concurrency, assignment, and emulator constraints against code-level sources where available (repo protocol + app startup).
-- Identified doc conflicts where frontend requirements docs diverge from current implemented flows.
-
-## Consolidated Requirements
-
-### Product / User Goals
-- The system shall support an assignment-based curation workflow where users work from a queue of assigned items and can request more assignments (“self-serve”). [frontend/CODEBASE.md](frontend/CODEBASE.md#L124-L149)
-- The system should support explicitly assigning a specific item to oneself, including conflict protection when another user already holds a draft assignment. [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md#L20-L39)
-- The system shall support both single-turn (Q/A) and multi-turn (conversation history) ground-truth editing while preserving backward compatibility for existing item shapes. [frontend/IMPLEMENTATION_SUMMARY.md](frontend/IMPLEMENTATION_SUMMARY.md#L104-L165), [backend/docs/multi-turn-refs.md](backend/docs/multi-turn-refs.md#L5-L75)
-
-### Frontend UX Requirements
-- The UI shall provide a single-page curation workspace with distinct queue, editor/actions, and references areas. [frontend/CODEBASE.md](frontend/CODEBASE.md#L79-L80)
-- The UI shall gate approval on reference completeness: at least one selected reference; all references visited; selected references include a key paragraph with minimum length (≥40 chars); deleted items cannot be approved. [frontend/CODEBASE.md](frontend/CODEBASE.md#L79-L79), [frontend/CODEBASE.md](frontend/CODEBASE.md#L119-L122), [frontend/src/components/app/defaultCurateInstructions.md](frontend/src/components/app/defaultCurateInstructions.md#L1-L4)
-- The UI shall support reference workflows including search, adding selected references, URL de-duplication, visited tracking (open-in-new-tab), and key-paragraph editing with a counter. [frontend/CODEBASE.md](frontend/CODEBASE.md#L141-L143), [frontend/CODEBASE.md](frontend/CODEBASE.md#L152-L165)
-- The UI should support removing a reference with an undo window and provide toast-based feedback for key actions and failures. [frontend/CODEBASE.md](frontend/CODEBASE.md#L136-L136), [frontend/CODEBASE.md](frontend/CODEBASE.md#L164-L165)
-- The UI shall support soft delete + restore semantics and prevent approval of deleted items. [frontend/CODEBASE.md](frontend/CODEBASE.md#L147-L147), [frontend/CODEBASE.md](frontend/CODEBASE.md#L79-L79)
-- The UI should detect no-op saves and report “No changes” rather than issuing an update that changes nothing. [frontend/CODEBASE.md](frontend/CODEBASE.md#L145-L145)
-- The UI shall support snapshot export by downloading a backend-provided JSON snapshot. [frontend/CODEBASE.md](frontend/CODEBASE.md#L146-L146)
-- The UI shall support multi-turn editing features (timeline, turn add/delete/edit, mode toggle), plus multi-turn approval constraints requiring reference relevance marking and key-paragraph constraints for “relevant” references. [frontend/IMPLEMENTATION_SUMMARY.md](frontend/IMPLEMENTATION_SUMMARY.md#L86-L151)
-- The UI should support a demo mode that disables or safely no-ops telemetry and can use mock providers. [frontend/README.md](frontend/README.md#L73-L92), [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L13-L17)
-- The UI should support dataset-level curation instructions fetch/update (including concurrency via ETag on update). [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L15-L18)
-
-### Backend / API Requirements
-- The backend shall expose a health endpoint at `GET /healthz`. [backend/CODEBASE.md](backend/CODEBASE.md#L14-L15), [backend/app/main.py](backend/app/main.py#L147-L149)
-- The backend shall accept both snake_case and camelCase inputs and always emit camelCase outputs. [backend/CODEBASE.md](backend/CODEBASE.md#L31-L32)
-- The backend shall enforce optimistic concurrency on write paths using ETags: updates require `If-Match` (or equivalent request ETag) and return HTTP 412 on missing/mismatch with stable error semantics. [backend/CODEBASE.md](backend/CODEBASE.md#L33-L33), [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md#L75-L113)
-- Assignment mutation endpoints shall enforce assignment ownership and return a stable ownership error when violated. [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md#L82-L86)
-- Assignment state transitions (approve/skip/delete) shall clear assignment fields atomically with the status change, and assignment timestamps shall be timezone-aware UTC. [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md#L7-L14), [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md#L154-L156)
-- Assignment list responses shall include `etag` in the JSON body (even if per-item `ETag` headers are optional). [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md#L31-L35)
-- The backend shall provide a single-item self-assign flow where assignment sets status to draft (even from approved/deleted/skipped) and rejects assignment of items draft-assigned to a different user. [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md#L29-L39)
-- The backend should maintain a secondary assignment document (materialized view) keyed for fast per-user assignment queries. [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md#L88-L95), [backend/app/adapters/repos/base.py](backend/app/adapters/repos/base.py#L55-L65)
-- Ground-truth item writes should be consolidated into SME PUT and Curator PUT flows (with import remaining create-only). [backend/docs/api-write-consolidation-plan.v2.md](backend/docs/api-write-consolidation-plan.v2.md#L62-L65)
-
-### Data & Storage Requirements
-- The backend shall abstract persistence behind a repository protocol to support multiple backends (Cosmos as production backend). [backend/app/adapters/repos/base.py](backend/app/adapters/repos/base.py#L17-L45), [backend/CODEBASE.md](backend/CODEBASE.md#L24-L30)
-- The backend shall support local development using the Cosmos DB Emulator and should not block startup if Cosmos initialization fails (e.g., emulator not ready). [backend/app/main.py](backend/app/main.py#L60-L82)
-- The system shall account for Cosmos DB Emulator query limitations (e.g., lack of `ARRAY_CONTAINS`) by adjusting behavior and/or skipping incompatible tests. [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md#L5-L25)
-- The system may support a Cosmos emulator-specific Unicode escape workaround when configured (to avoid emulator-only invalid escape failures). [backend/docs/cosmos-emulator-unicode-workaround.md](backend/docs/cosmos-emulator-unicode-workaround.md#L35-L38)
-
-### Export / Snapshot Requirements
-- The backend shall support snapshot export in `attachment` (single JSON) and `artifact` (per-item JSON + manifest) modes with defined defaults when no request body is provided. [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md#L24-L33)
-- The snapshot download endpoint shall return a JSON document payload (not artifacts). [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md#L34-L40)
-- Artifact exports shall include a manifest with a stable `schemaVersion` and related snapshot metadata. [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md#L81-L93)
-- Export processors shall run before formatting and may merge tag fields into a single exported `tags` array. [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md#L117-L127)
-
-### Observability & Operations Requirements
-- Client telemetry shall be opt-in, disabled by default, and safe-by-default (no-op in demo mode or when configuration is missing). [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L13-L17), [frontend/README.md](frontend/README.md#L82-L92)
-- The UI shall provide an error boundary that catches rendering errors and renders a user-friendly fallback (and may integrate with telemetry when enabled). [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L79-L86)
-
-### Security & Privacy Requirements
-- In development, the system should support user simulation via an `X-User-Id` header to drive per-user assignment behavior and testing. [backend/README.md](backend/README.md#L336-L338), [frontend/README.md](frontend/README.md#L27-L32)
-
-### Quality / Testing Requirements
-- Tag normalization should be deterministic (normalize + deduplicate + sort) to ensure stable storage and comparisons. [backend/docs/tagging_plan.md](backend/docs/tagging_plan.md#L54-L57)
-- Emulator-incompatible tests (or behaviors) should be gated or skipped to avoid false failures in local/emulator workflows. [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md#L9-L12)
-
-## Gaps and Conflicts
-
-- Reference search capability conflicts across frontend docs: MVP doc claims no backend search API endpoint, while the codebase guide describes a backend `searchReferences` flow used by the UI. [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L27-L31), [frontend/CODEBASE.md](frontend/CODEBASE.md#L141-L142)
-- Tag semantics/validation scope is ambiguous between “canonical `group:value` tags” and per-history optional tags (unclear whether the same normalization/validation rules apply). [backend/docs/tagging_plan.md](backend/docs/tagging_plan.md#L5-L13), [backend/docs/history-tags-feature.md](backend/docs/history-tags-feature.md#L3-L6)
-- Tag registry write expectations conflict: MVP doc states “allow the user to create new tags” while also stating “no write endpoints for tags.” [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L22-L24)
-- Cosmos emulator Unicode workaround coverage may be inconsistent: workaround doc claims it is applied to tag storage, but the tags repo upsert path shown does not indicate any special encoding/sanitization. [backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md](backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md#L111-L123), [backend/app/adapters/repos/tags_repo.py](backend/app/adapters/repos/tags_repo.py#L131-L141)
-
-## Next Validation Questions
-
-- Should reference search be treated as a required capability (backend API exists/should exist), or is it optional/stubbed for now? [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L27-L31), [frontend/CODEBASE.md](frontend/CODEBASE.md#L141-L142)
-- For tags: are users allowed to create new tags end-to-end, and if so, what is the intended write path (if “no write endpoints” remains true)? [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L22-L24), [backend/app/adapters/repos/tags_repo.py](backend/app/adapters/repos/tags_repo.py#L131-L154)
-- For multi-turn: is backend persistence expected to include reference relevance fields (relevant/neutral/irrelevant), or is that currently frontend-only state? [frontend/IMPLEMENTATION_SUMMARY.md](frontend/IMPLEMENTATION_SUMMARY.md#L93-L151)
-- For assignments: confirm intended semantics for listing “my assignments” (draft-only vs broader statuses) and how single-item assignment should interact with those semantics. [backend/CODEBASE.md](backend/CODEBASE.md#L154-L156), [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md#L29-L39)
-- For Cosmos emulator Unicode handling: should tag registry writes also apply the configured workaround (as docs imply), or should the docs be updated to reflect current behavior? [backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md](backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md#L111-L123), [backend/app/adapters/repos/tags_repo.py](backend/app/adapters/repos/tags_repo.py#L131-L141)
-
-## PRD Items Not Yet Supported (Tracked Separately)
-
-> These appear in PRD artifacts but are not clearly supported by the existing frontend/backend system behaviors today.
-
-- AI-powered reference retrieval (attach/detach, query orchestration) and LLM-generated artifacts. Source PRD artifacts: [prd.json](prd.json), [ralph/ralph-prd.txt](ralph/ralph-prd.txt)
-- Dedicated tag administration endpoints/UI beyond current normalization + selection behaviors. Source PRD artifacts: [prd.json](prd.json)
-- Full auth/RBAC integration (e.g., Entra) beyond the dev `X-User-Id` simulation mechanism. Source PRD artifacts: [prd.json](prd.json)
-
-For a fuller breakdown (with evidence + “matches existing system” flags), see: [.copilot-tracking/subagent/20260121/prd-requirements-research.md](.copilot-tracking/subagent/20260121/prd-requirements-research.md)
diff --git a/.copilot-tracking/spec-sessions/jtbd-001-current-state.state.json b/.copilot-tracking/spec-sessions/jtbd-001-current-state.state.json
deleted file mode 100644
index 15ae312..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-001-current-state.state.json
+++ /dev/null
@@ -1,72 +0,0 @@
-{
-  "jtbdId": "JTBD-001",
-  "jtbdStatement": "Help curators review and approve ground-truth data items through an assignment-based workflow",
-  "lastAccessed": "2026-01-22T00:00:00Z",
-  "currentPhase": "handoff",
-  "completedPhases": [
-    "jtbd-discovery",
-    "topic-decomposition",
-    "topic-research",
-    "spec-generation"
-  ],
-  "topics": [
-    {
-      "name": "assignment-workflow",
-      "description": "The assignment workflow manages how curators receive, claim, and complete work items.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/assignment-workflow-research.md",
-      "specFile": "specs/assignment-workflow.md",
-      "status": "complete"
-    },
-    {
-      "name": "explorer-view",
-      "description": "The explorer view allows curators to browse and filter ground-truth items outside the assigned queue.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/explorer-view-research.md",
-      "specFile": "specs/explorer-view.md",
-      "status": "complete"
-    },
-    {
-      "name": "curation-editor",
-      "description": "The curation editor enables viewing and editing ground-truth content, including tags.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/curation-editor-research.md",
-      "specFile": "specs/curation-editor.md",
-      "status": "complete"
-    },
-    {
-      "name": "reference-management",
-      "description": "The reference management system supports adding, visiting, and annotating supporting sources.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/reference-management-research.md",
-      "specFile": "specs/reference-management.md",
-      "status": "complete"
-    },
-    {
-      "name": "export-snapshots",
-      "description": "The export system generates downloadable JSON snapshots of curated data.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/export-snapshots-research.md",
-      "specFile": "specs/export-snapshots.md",
-      "status": "complete"
-    },
-    {
-      "name": "data-persistence",
-      "description": "The persistence layer abstracts storage behind repositories with Cosmos DB as the primary backend.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/data-persistence-research.md",
-      "specFile": "specs/data-persistence.md",
-      "status": "complete"
-    },
-    {
-      "name": "observability-operations",
-      "description": "The observability and operations system provides opt-in telemetry, error handling, and health status.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/observability-operations-research.md",
-      "specFile": "specs/observability-operations.md",
-      "status": "complete"
-    }
-  ],
-  "openQuestions": [],
-  "nextActions": []
-}
diff --git a/.copilot-tracking/spec-sessions/jtbd-002-sme-curation.state.json b/.copilot-tracking/spec-sessions/jtbd-002-sme-curation.state.json
deleted file mode 100644
index bfc1b0f..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-002-sme-curation.state.json
+++ /dev/null
@@ -1,82 +0,0 @@
-{
-  "jtbdId": "JTBD-002",
-  "jtbdStatement": "Help SMEs curate ground truth items effectively (enhancements)",
-  "lastAccessed": "2026-01-22T00:00:00Z",
-  "currentPhase": "spec-generation",
-  "completedPhases": [
-    "jtbd-discovery",
-    "topic-decomposition",
-    "topic-research"
-  ],
-  "topics": [
-    {
-      "name": "assignment-error-feedback",
-      "description": "The assignment error feedback system displays specific, actionable messages when assignment operations fail due to conflicts.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/assignment-error-feedback-research.md",
-      "specFile": "specs/assignment-error-feedback.md",
-      "status": "draft"
-    },
-    {
-      "name": "assignment-takeover",
-      "description": "The assignment takeover system allows SMEs to reassign items currently assigned to others with appropriate confirmation.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/assignment-takeover-research.md",
-      "specFile": "specs/assignment-takeover.md",
-      "status": "draft"
-    },
-    {
-      "name": "explorer-state-preservation",
-      "description": "The explorer state preservation system maintains filter and view state when users perform actions that navigate away.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/explorer-state-preservation-research.md",
-      "specFile": "specs/explorer-state-preservation.md",
-      "status": "draft"
-    },
-    {
-      "name": "draft-duplicate-detection",
-      "description": "The draft duplicate detection system warns SMEs when draft items appear to duplicate approved items.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/draft-duplicate-detection-research.md",
-      "specFile": "specs/draft-duplicate-detection.md",
-      "status": "draft"
-    },
-    {
-      "name": "modal-keyboard-handling",
-      "description": "The modal keyboard handling system ensures keyboard shortcuts do not unexpectedly close or interfere with modal interactions.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/modal-keyboard-handling-research.md",
-      "specFile": "specs/modal-keyboard-handling.md",
-      "status": "draft"
-    },
-    {
-      "name": "validation-error-clarity",
-      "description": "The validation error clarity system translates backend validation errors into user-friendly messages with remediation guidance.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/validation-error-clarity-research.md",
-      "specFile": "specs/validation-error-clarity.md",
-      "status": "draft"
-    },
-    {
-      "name": "inspection-performance",
-      "description": "The inspection performance system caches and memoizes data to improve responsiveness when viewing ground truth items.",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/inspection-performance-research.md",
-      "specFile": "specs/inspection-performance.md",
-      "status": "draft"
-    }
-  ],
-  "openQuestions": [
-    "What UI surface should the View assignee action open by default?",
-    "Which Explorer filters and UI controls should be included in URL state for v1?",
-    "What is the first duplicate matching heuristic for v1?",
-    "Which non-Escape keys should be stopped from propagating for all modals?",
-    "Which field name is canonical for the key paragraph in the API?",
-    "What TTL is appropriate for the inspect cache if we cannot reliably invalidate on all edits?"
-  ],
-  "nextActions": [
-    "Review specs for accuracy against current codebase",
-    "Resolve open questions and update specs",
-    "Hand off to Planning mode to generate an implementation plan"
-  ]
-}
diff --git a/.copilot-tracking/spec-sessions/jtbd-003-find-filter.state.json b/.copilot-tracking/spec-sessions/jtbd-003-find-filter.state.json
deleted file mode 100644
index ddf0dfc..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-003-find-filter.state.json
+++ /dev/null
@@ -1,52 +0,0 @@
-{
-  "jtbdId": "JTBD-003",
-  "jtbdStatement": "Help users find and filter ground truth items (enhancements)",
-  "lastAccessed": "2026-01-22T12:00:00Z",
-  "currentPhase": "handoff",
-  "completedPhases": [
-    "jtbd-discovery",
-    "topic-decomposition",
-    "topic-research",
-    "spec-generation"
-  ],
-  "topics": [
-    {
-      "name": "keyword-search",
-      "description": "The keyword search system enables users to find ground truth items by searching text across all multi-turn history",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/keyword-search-research.md",
-      "specFile": "specs/keyword-search.md",
-      "status": "complete",
-      "stories": ["SA-828"]
-    },
-    {
-      "name": "tag-filtering",
-      "description": "The tag filtering system allows users to include, exclude, or apply boolean logic to filter items by tags",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/tag-filtering-research.md",
-      "specFile": "specs/tag-filtering.md",
-      "status": "complete",
-      "stories": ["SA-363"]
-    },
-    {
-      "name": "explorer-sorting",
-      "description": "The Explorer sorting system handles column sort order, sort direction indicators, and tag-count sorting",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/explorer-sorting-research.md",
-      "specFile": "specs/explorer-sorting.md",
-      "status": "complete",
-      "stories": ["SA-684", "SA-361"]
-    }
-  ],
-  "excludedStories": [
-    {
-      "story": "SA-463",
-      "reason": "Bug fix - Explorer layout overflow. Too small for full spec, treat as quick fix."
-    }
-  ],
-  "openQuestions": [],
-  "nextActions": [
-    "Hand off to Planning Mode to generate IMPLEMENTATION_PLAN.md",
-    "Or continue with another JTBD specification"
-  ]
-}
diff --git a/.copilot-tracking/spec-sessions/jtbd-004-data-integrity-security.state.json b/.copilot-tracking/spec-sessions/jtbd-004-data-integrity-security.state.json
deleted file mode 100644
index db79c44..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-004-data-integrity-security.state.json
+++ /dev/null
@@ -1,47 +0,0 @@
-{
-  "jtbdId": "JTBD-004",
-  "jtbdStatement": "Help administrators ensure data integrity and security",
-  "lastAccessed": "2026-01-22T12:00:00Z",
-  "currentPhase": "handoff",
-  "completedPhases": ["jtbd-discovery", "topic-decomposition", "topic-research", "spec-generation"],
-  "topics": [
-    {
-      "name": "pii-detection",
-      "description": "The PII detection system scans ground truth content during import to identify and flag personally identifiable information",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/pii-detection-research.md",
-      "specFile": "specs/pii-detection.md",
-      "status": "complete",
-      "stories": ["SA-669"]
-    },
-    {
-      "name": "dos-prevention",
-      "description": "The DoS prevention system enforces batch size limits and rate limiting on bulk import endpoints",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/dos-prevention-research.md",
-      "specFile": "specs/dos-prevention.md",
-      "status": "complete",
-      "stories": ["SA-409"]
-    },
-    {
-      "name": "xss-sanitization",
-      "description": "The XSS sanitization system cleanses user-generated content to prevent script injection attacks",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/xss-sanitization-research.md",
-      "specFile": "specs/xss-sanitization.md",
-      "status": "complete",
-      "stories": ["SA-565"]
-    },
-    {
-      "name": "batch-validation",
-      "description": "The batch validation system provides detailed error feedback and proper data integrity checks during bulk imports",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/batch-validation-research.md",
-      "specFile": "specs/batch-validation.md",
-      "status": "complete",
-      "stories": ["SA-241"]
-    }
-  ],
-  "openQuestions": ["Should SA-565 be updated to reflect URL validation gap instead of textarea XSS?"],
-  "nextActions": ["Ready for handoff to Planning Mode"]
-}
diff --git a/.copilot-tracking/spec-sessions/jtbd-005-code-quality.state.json b/.copilot-tracking/spec-sessions/jtbd-005-code-quality.state.json
deleted file mode 100644
index db6c523..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-005-code-quality.state.json
+++ /dev/null
@@ -1,64 +0,0 @@
-{
-  "jtbdId": "JTBD-005",
-  "jtbdStatement": "Help developers maintain GTC code quality",
-  "lastAccessed": "2026-01-22T12:00:00Z",
-  "currentPhase": "handoff",
-  "completedPhases": [
-    "jtbd-discovery",
-    "topic-decomposition",
-    "topic-research",
-    "spec-generation"
-  ],
-  "stories": [
-    "SA-746",
-    "SA-424",
-    "SA-745",
-    "SA-238",
-    "SA-249",
-    "SA-250",
-    "SA-245"
-  ],
-  "topics": [
-    {
-      "name": "architecture-refactoring",
-      "description": "The architecture refactoring extracts duplicate API logic into services and splits the repository layer into focused modules",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/architecture-refactoring-research.md",
-      "specFile": "specs/architecture-refactoring.md",
-      "status": "complete",
-      "stories": ["SA-746", "SA-424"]
-    },
-    {
-      "name": "dependency-injection",
-      "description": "The dependency injection refactoring adopts FastAPI's DI patterns for configuration and services",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/dependency-injection-research.md",
-      "specFile": "specs/dependency-injection.md",
-      "status": "complete",
-      "stories": ["SA-238"]
-    },
-    {
-      "name": "ci-code-quality",
-      "description": "The CI code quality enforcement adds linting, formatting, and pre-push hooks with drift reconciliation",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/ci-code-quality-research.md",
-      "specFile": "specs/ci-code-quality.md",
-      "status": "complete",
-      "stories": ["SA-745"]
-    },
-    {
-      "name": "code-conventions",
-      "description": "The code conventions standardize Pydantic model usage, exception handling, and logging patterns",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/code-conventions-research.md",
-      "specFile": "specs/code-conventions.md",
-      "status": "complete",
-      "stories": ["SA-249", "SA-250", "SA-245"]
-    }
-  ],
-  "openQuestions": [],
-  "nextActions": [
-    "Hand off to Planning Mode to generate IMPLEMENTATION_PLAN.md",
-    "Run task-planner for gap analysis against existing code"
-  ]
-}
diff --git a/.copilot-tracking/spec-sessions/jtbd-006-documentation.state.json b/.copilot-tracking/spec-sessions/jtbd-006-documentation.state.json
deleted file mode 100644
index b3d8757..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-006-documentation.state.json
+++ /dev/null
@@ -1,57 +0,0 @@
-{
-  "jtbdId": "JTBD-006",
-  "jtbdStatement": "Help teams understand GTC through documentation",
-  "stories": ["SA-835", "SA-422", "SA-205"],
-  "lastAccessed": "2026-01-22T12:00:00Z",
-  "currentPhase": "handoff",
-  "completedPhases": [
-    "jtbd-discovery",
-    "topic-decomposition",
-    "topic-research",
-    "spec-generation"
-  ],
-  "topics": [
-    {
-      "name": "docs-infrastructure",
-      "description": "The docs infrastructure provides MkDocs setup with build/serve commands and navigation structure",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/docs-infrastructure-research.md",
-      "specFile": "specs/docs-infrastructure.md",
-      "status": "complete"
-    },
-    {
-      "name": "docs-content-strategy",
-      "description": "The content strategy defines audience-specific documentation organization, migration paths, and drift reconciliation workflows",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/docs-content-strategy-research.md",
-      "specFile": "specs/docs-content-strategy.md",
-      "status": "complete"
-    },
-    {
-      "name": "tag-glossary",
-      "description": "The tag glossary surfaces tag definitions to users through the UI and allows definitions to be managed",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/tag-glossary-research.md",
-      "specFile": "specs/tag-glossary.md",
-      "status": "complete"
-    }
-  ],
-  "decisions": [
-    {
-      "question": "Where should MkDocs be set up?",
-      "answer": "Backend Python environment",
-      "date": "2026-01-22"
-    },
-    {
-      "question": "How should tag glossary definitions be stored?",
-      "answer": "Hybrid: config for system tags, database for SME-created tags",
-      "date": "2026-01-22"
-    }
-  ],
-  "openQuestions": [],
-  "nextActions": [
-    "Hand off to Planning Mode to generate IMPLEMENTATION_PLAN.md",
-    "Planning Mode performs gap analysis against existing code",
-    "Building Mode implements tasks from plan"
-  ]
-}
diff --git a/.copilot-tracking/spec-sessions/jtbd-007-chunked-references.state.json b/.copilot-tracking/spec-sessions/jtbd-007-chunked-references.state.json
deleted file mode 100644
index 01fa35f..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-007-chunked-references.state.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
-  "jtbdId": "JTBD-007",
-  "jtbdStatement": "Help GTC handle chunked document references correctly",
-  "lastAccessed": "2026-01-22T12:00:00Z",
-  "currentPhase": "spec-generation",
-  "completedPhases": ["jtbd-discovery", "topic-decomposition", "topic-research", "spec-generation"],
-  "topics": [
-    {
-      "name": "reference-identity",
-      "description": "The reference identity system uses chunk ID from the search index as the primary uniqueness key instead of URL",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/reference-identity-research.md",
-      "specFile": "specs/reference-identity.md",
-      "status": "specified"
-    }
-  ],
-  "openQuestions": [],
-  "resolvedQuestions": [
-    "UI should display chunk ID and allow users to view chunk text content",
-    "Identity key is chunk ID alone; references are stored per-turn in history[].refs[]"
-  ],
-  "stories": ["SA-821", "SA-257"],
-  "excludedStories": {
-    "SA-447": "Moved to separate JTBD for export/split-tags"
-  },
-  "nextActions": ["Hand off to Planning Mode to generate IMPLEMENTATION_PLAN.md"]
-}
diff --git a/.copilot-tracking/spec-sessions/jtbd-008-cosmos-performance.state.json b/.copilot-tracking/spec-sessions/jtbd-008-cosmos-performance.state.json
deleted file mode 100644
index 15f72d8..0000000
--- a/.copilot-tracking/spec-sessions/jtbd-008-cosmos-performance.state.json
+++ /dev/null
@@ -1,49 +0,0 @@
-{
-  "jtbdId": "JTBD-008",
-  "jtbdStatement": "Help optimize GTC performance and Cosmos usage",
-  "lastAccessed": "2026-01-22T12:30:00Z",
-  "currentPhase": "spec-generation",
-  "completedPhases": ["jtbd-discovery", "topic-decomposition", "topic-research", "spec-generation"],
-  "topics": [
-    {
-      "name": "cosmos-indexing",
-      "description": "The indexing strategy limits indexed fields to reduce write RU costs",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/cosmos-indexing-research.md",
-      "specFile": "specs/cosmos-indexing.md",
-      "status": "specified",
-      "stories": ["SA-242"]
-    },
-    {
-      "name": "partial-updates",
-      "description": "The partial update system patches only changed fields instead of replacing entire documents",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/partial-updates-research.md",
-      "specFile": "specs/partial-updates.md",
-      "status": "specified",
-      "stories": ["SA-244"]
-    },
-    {
-      "name": "query-optimization",
-      "description": "The query optimization effort replaces expensive cross-partition queries with efficient patterns",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/query-optimization-research.md",
-      "specFile": "specs/query-optimization.md",
-      "status": "specified",
-      "stories": ["SA-247", "SA-248"]
-    },
-    {
-      "name": "concurrency-control",
-      "description": "The concurrency control mechanism prevents race conditions during simultaneous updates",
-      "scopeValidated": true,
-      "researchFile": ".copilot-tracking/subagent/20260122/concurrency-control-research.md",
-      "specFile": "specs/concurrency-control.md",
-      "status": "specified",
-      "stories": ["SA-246"]
-    }
-  ],
-  "openQuestions": [],
-  "nextActions": [
-    "Hand off to Planning Mode for implementation plan generation"
-  ]
-}
diff --git a/.copilot-tracking/subagent/20260121/api-logic-research.md b/.copilot-tracking/subagent/20260121/api-logic-research.md
deleted file mode 100644
index 8fd05fc..0000000
--- a/.copilot-tracking/subagent/20260121/api-logic-research.md
+++ /dev/null
@@ -1,224 +0,0 @@
----
-title: API logic research
-description: Candidates for moving business logic out of FastAPI handlers into service-layer modules
-author: GitHub Copilot
-ms.date: 2026-01-21
-ms.topic: reference
-keywords:
-  - fastapi
-  - service layer
-  - refactor
-  - concurrency
-  - etag
-  - tags
-estimated_reading_time: 8
----
-
-## Goal
-
-Identify backend API endpoints that contain business logic beyond orchestration, and map that logic to service-layer boundaries.
-
-## Summary of findings
-
-* Several handlers in `backend/app/api/v1/*` perform domain workflows directly against `container.repo`.
-* The heaviest duplication centers on:
-  * Partial-update semantics across multiple fields
-  * ETag enforcement and error mapping
-  * Tag constraints and computed-tag recomputation
-  * History parsing (including references embedded in history)
-* Services already exist for several chunks of this logic (`AssignmentService`, `TaggingService`, `ValidationService`, `TagRegistryService`, `SnapshotService`, `ChatService`), but handlers still own cross-cutting workflow steps.
-
-## Service boundary guidance
-
-* API layer responsibilities
-  * Authenticate and authorize
-  * Parse inputs, perform lightweight request-shape validation
-  * Translate service errors to HTTP status codes
-* Service layer responsibilities
-  * Domain workflows and state transitions
-  * Concurrency rules (ETag requirements) and retryable failures
-  * Tag normalization, manual tag constraints, computed-tag recomputation
-  * Shared parsing/normalization of payload fields that appear across endpoints
-
-## Endpoint candidates
-
-### 1) SME assignment update workflow
-
-File: `backend/app/api/v1/assignments.py`
-
-Excerpt: [backend/app/api/v1/assignments.py#L72-L255](backend/app/api/v1/assignments.py#L72-L255)
-
-What is happening in the handler:
-
-* Joins multiple concerns:
-  * Ownership enforcement (`assignedTo` must match caller)
-  * Partial-update semantics driven by `model_fields_set`
-  * Approval/status transitions that clear assignment and set review metadata
-  * Parsing and validating `history` with embedded `refs`
-  * ETag enforcement via `If-Match` or body `etag` and mapping mismatch to HTTP 412
-  * Computed tag application before persisting
-  * Best-effort deletion of the assignment document after completion
-
-Service extraction candidates:
-
-* Move domain workflow into `AssignmentService` or a new `GroundTruthUpdateService`
-  * `update_assigned_item(dataset: str, bucket: UUID, item_id: str, user_id: str, update: AssignmentUpdateRequest, if_match: str | None) -> GroundTruthItem`
-  * Keep the API handler responsible for request parsing only
-* Extract shared helpers for use by both assignments and ground-truth CRUD:
-  * `parse_history(payload_history: list[dict[str, Any]] | None) -> list[HistoryItem] | None`
-  * `require_etag(if_match: str | None, body_etag: str | None) -> str`
-
-Notes on existing services:
-
-* `apply_computed_tags` already exists in `app/services/tagging_service.py` and is called here from the handler
-* The handler still owns the workflow steps and error mapping that are likely to be repeated elsewhere
-
-### 2) Single-item assignment orchestration
-
-File: `backend/app/api/v1/assignments.py`
-
-Excerpt: [backend/app/api/v1/assignments.py#L257-L323](backend/app/api/v1/assignments.py#L257-L323)
-
-What is happening in the handler:
-
-* Delegates assignment to `container.assignment_service.assign_single_item`
-* Contains business-ish translation logic that likely belongs in a consistent error mapper:
-  * Converts different `ValueError` message substrings to 404 vs 409 vs 400
-
-Service extraction candidates:
-
-* Keep `AssignmentService.assign_single_item` as-is, but standardize errors:
-  * Prefer typed exceptions (e.g., `NotFoundError`, `AlreadyAssignedError`, `InvalidStateError`) so HTTP mapping is stable and not substring-based
-
-### 3) Bulk import workflow
-
-File: `backend/app/api/v1/ground_truths.py`
-
-Excerpt: [backend/app/api/v1/ground_truths.py#L54-L127](backend/app/api/v1/ground_truths.py#L54-L127)
-
-What is happening in the handler:
-
-* Implements a full workflow, not just orchestration:
-  * Generates IDs for missing items using `randomname`
-  * Validates all items via `validate_bulk_items` and filters invalid items
-  * Optionally applies approval metadata for all surviving items
-  * Applies computed tags for each item (fetches registry once)
-  * Persists through `container.repo.import_bulk_gt`
-
-Service extraction candidates:
-
-* Move into a dedicated import service (or a `GroundTruthService`):
-  * `import_bulk(items: list[GroundTruthItem], *, buckets: int | None, approve: bool, user_id: str | None) -> ImportBulkResponse`
-* Explicitly separate concerns:
-  * ID generation and order preservation
-  * Validation and error aggregation
-  * Approval metadata policy
-  * Tag recomputation policy
-
-Notes on existing services:
-
-* `validate_bulk_items` is in `app/services/validation_service.py`
-* Computed-tag logic is in `app/services/tagging_service.py`
-* The handler currently coordinates all these pieces and should become a thin wrapper
-
-### 4) Ground-truth list query validation
-
-File: `backend/app/api/v1/ground_truths.py`
-
-Excerpt: [backend/app/api/v1/ground_truths.py#L160-L252](backend/app/api/v1/ground_truths.py#L160-L252)
-
-What is happening in the handler:
-
-* Implements query normalization and validation rules:
-  * Coerces `status` string into `GroundTruthStatus`
-  * Validates `limit` and `page`
-  * Trims `itemId` and `refUrl`, enforces max lengths
-  * Parses comma-separated `tags` with max tag count and max length
-
-Service extraction candidates:
-
-* Keep low-level validation here if it stays purely request-level, but consider extracting for reuse:
-  * `normalize_list_query(status: str | None, item_id: str | None, ref_url: str | None, tags: str | None, page: int, limit: int) -> NormalizedQuery`
-
-### 5) Ground-truth update workflow
-
-File: `backend/app/api/v1/ground_truths.py`
-
-Excerpt: [backend/app/api/v1/ground_truths.py#L283-L394](backend/app/api/v1/ground_truths.py#L283-L394)
-
-What is happening in the handler:
-
-* Repeats many of the same concerns as the assignments update endpoint:
-  * Partial updates across multiple fields (including status coercion)
-  * Reference parsing from list payloads
-  * Explicit rejection of `computedTags` and legacy `tags` (business rule)
-  * Manual tag update, mapped through domain validation
-  * History parsing, including parsing `refs` and `expectedBehavior`
-  * ETag requirement and mismatch mapping to HTTP 412
-  * Computed tag application before persisting
-  * Re-fetching to return latest ETag
-
-Service extraction candidates:
-
-* Create a shared update service used by both SME and admin-like updates:
-  * `update_item(dataset: str, bucket: UUID, item_id: str, payload: dict[str, Any], *, if_match: str | None, user_id: str | None) -> GroundTruthItem`
-* Consolidate shared parsing/validation helpers with the SME update handler:
-  * History parsing and reference parsing
-  * ETag policy enforcement and mismatch translation
-  * Tag-field acceptance policy (manual-only)
-
-### 6) Bulk recompute computed tags
-
-File: `backend/app/api/v1/ground_truths.py`
-
-Excerpt: [backend/app/api/v1/ground_truths.py#L408-L504](backend/app/api/v1/ground_truths.py#L408-L504)
-
-What is happening in the handler:
-
-* Implements a batch domain workflow:
-  * Fetches items based on filter criteria
-  * Applies computed tags for each item and diffs tag sets
-  * On changes and `dry_run=false`, bypasses ETag and upserts
-  * Aggregates errors and logs a summary
-
-Service extraction candidates:
-
-* Move into `TaggingService` or a dedicated maintenance service:
-  * `recompute_computed_tags(*, dataset: str | None, status: GroundTruthStatus | None, dry_run: bool) -> RecomputeTagsResponse`
-* Centralize the “bypass ETag for maintenance” rule in one place
-
-## Additional candidates
-
-### Chat endpoint input and error policy
-
-File: `backend/app/api/v1/chat.py`
-
-Excerpt: [backend/app/api/v1/chat.py#L29-L158](backend/app/api/v1/chat.py#L29-L158)
-
-Notes:
-
-* Message sanitation and validation are largely request-layer concerns.
-* The handler owns error-to-status mapping and correlation ID propagation. If this pattern repeats, it could be centralized (for example, a shared exception-to-response utility), but it is not urgent compared to GT/assignment workflows.
-
-### Tags endpoint config precedence
-
-File: `backend/app/api/v1/tags.py`
-
-Excerpt: [backend/app/api/v1/tags.py#L66-L106](backend/app/api/v1/tags.py#L66-L106)
-
-Notes:
-
-* The handler determines the source of truth for manual tags based on `settings.ALLOWED_MANUAL_TAGS` vs persisted registry.
-* This is a domain/config decision and is a good candidate for `TagRegistryService`:
-  * `list_manual_tags_with_computed_filtered() -> tuple[list[str], list[str]]`
-
-## Container and DI observations
-
-* API handlers frequently depend on the global `container` singleton and call `container.repo.*` directly.
-* When extracting services, prefer constructor-injected dependencies (repo protocols, registry providers) to reduce implicit coupling and make unit testing easier.
-
-## Suggested next steps
-
-* Extract shared “ground truth update” workflow into a single service method used by both assignments and ground-truth CRUD.
-* Replace substring-based error mapping with typed domain exceptions to stabilize HTTP status codes.
-* Keep handler functions thin: authentication, request parsing, and response formatting only.
diff --git a/.copilot-tracking/subagent/20260121/backend-requirements-research.md b/.copilot-tracking/subagent/20260121/backend-requirements-research.md
deleted file mode 100644
index 972a191..0000000
--- a/.copilot-tracking/subagent/20260121/backend-requirements-research.md
+++ /dev/null
@@ -1,210 +0,0 @@
-# Backend Behavioral Requirements (Doc-Inferred)
-
-Date: 2026-01-21
-
-This document captures stable, high-level backend behavioral requirements inferred from backend documentation. It is intended to describe “what the backend must do” in a testable way, not propose new features.
-
-## Scope and sources
-
-Primary sources reviewed:
-
-- backend/README.md
-- backend/CODEBASE.md
-- backend/docs/api-change-checklist-assignments.md
-- backend/docs/assign-single-item-endpoint.md
-- backend/docs/api-write-consolidation-plan.v2.md
-- backend/docs/drift_cleanup.md
-- backend/docs/tagging_plan.md
-- backend/docs/export-pipeline.md
-- backend/docs/multi-turn-refs.md
-- backend/docs/history-tags-feature.md
-- backend/docs/cosmos-emulator-limitations.md
-
-## Requirements (by area)
-
-### API wire conventions
-
-- The API accepts both snake_case and camelCase inputs, but responses are always camelCase (aliases) via Pydantic.
-	- Evidence (backend/CODEBASE.md#L32-L34):
-		> - Pydantic v2 with aliases: accept snake_case or camelCase on input; always output camelCase via model_dump(..., by_alias=True).
-
-### Health check
-
-- The service exposes a health endpoint at `GET /healthz`.
-	- Evidence (backend/CODEBASE.md#L14-L15, backend/CODEBASE.md#L159-L161):
-		> - GET /healthz returns repo/backend info (Cosmos details when active).
-		> - Health check: GET /healthz
-
-### Optimistic concurrency (ETag / If-Match)
-
-- Updates must enforce optimistic concurrency using Cosmos ETags.
-	- Clients may supply ETag via `If-Match` header or an `etag` field in request body (depending on endpoint); missing/mismatch maps to HTTP 412.
-	- Evidence (backend/CODEBASE.md#L32-L34):
-		> - Concurrency uses ETag: updates require If-Match header or etag in body; 412 on missing/mismatch.
-
-- Assignment write paths must require `If-Match` and return/echo the updated ETag.
-	- Evidence (backend/docs/api-change-checklist-assignments.md#L7-L12, backend/docs/api-change-checklist-assignments.md#L74-L90):
-		> - Require `If-Match` on all write paths (approve/skip/delete) and return updated ETag.
-		> - Request headers (required):
-		>   - `If-Match: <etag>` (all write paths)
-		> - 412 Precondition Failed: Missing/invalid ETag. Error code `IF_MATCH_REQUIRED` or `ETAG_MISMATCH`. Include current ETag in `ETag` header.
-	- Evidence (backend/docs/api-change-checklist-assignments.md#L168-L178):
-		> - Concurrency: All writes require `If-Match` with the current ETag. On mismatch, return 412 and provide the current ETag in the `ETag` header.
-		> - ETag: Return the new ETag in the 200 response body and `ETag` header.
-
-### Delete semantics (soft delete)
-
-- “Delete” is represented as `status=deleted` (soft delete), and list APIs filter deleted items unless status is explicitly requested.
-	- Evidence (backend/CODEBASE.md#L33-L34):
-		> - Soft-delete via status=deleted; list APIs filter unless status is specified.
-
-### Write surface area (consolidation)
-
-- Ground Truth item writes are consolidated to two update endpoints: SME PUT and Curator PUT.
-	- Evidence (backend/docs/api-write-consolidation-plan.v2.md#L28-L36, backend/docs/api-write-consolidation-plan.v2.md#L60-L67):
-		> - SME PUT `/v1/assignments/{item_id}`
-		>   - Add: optional `etag` in body, and accept `If-Match` header
-		> - Curator PUT `/v1/ground-truths/{datasetName}/{item_id}`
-		>   - Add: optional `etag` in body, and accept `If-Match` header
-		> - Only two endpoints perform writes to GT items: SME PUT and Curator PUT.
-
-- Curator import remains a separate POST and is create-only (no updates).
-	- Evidence (backend/docs/api-write-consolidation-plan.v2.md#L38-L45, backend/docs/api-write-consolidation-plan.v2.md#L62-L64):
-		> - Curator POST `/v1/ground-truths` (import)
-		>   - Unchanged in path/method; clarify it’s for create/import only (no updates)
-		> - Curator POST import remains for create-only flows.
-
-- Reference subroutes are removed; references are handled via the PUTs.
-	- Evidence (backend/docs/api-write-consolidation-plan.v2.md#L21-L25, backend/docs/api-write-consolidation-plan.v2.md#L64-L66):
-		> | POST   | /v1/ground-truths/{datasetName}/{item_id}/references          | Curator | Add references to item      | Remove | Fold into Curator PUT with references           |
-		> | DELETE | /v1/ground-truths/{datasetName}/{item_id}/references/{ref_id} | Curator | Remove a specific reference | Remove | Fold into Curator/SME PUT via references        |
-		> - Reference-specific endpoints are removed and covered by references in PUTs.
-
-### Assignments
-
-- Ownership must be enforced on SME mutation routes; non-owner attempts return 403 with stable error code.
-	- Evidence (backend/docs/api-change-checklist-assignments.md#L7-L12, backend/docs/api-change-checklist-assignments.md#L82-L86):
-		> - Enforce ownership on SME update route with 403 and stable error code.
-		> - 403 Forbidden: Ownership violation. Error code `ASSIGNMENT_OWNERSHIP`.
-	- Evidence (backend/docs/api-change-checklist-assignments.md#L168-L176):
-		> - Ownership: Only the currently assigned user may mutate. If unassigned or assigned to a different user, return 403/`ASSIGNMENT_OWNERSHIP`.
-
-- On transitions to skipped/approved/deleted, assignment fields must be cleared atomically (same write).
-	- Evidence (backend/docs/api-change-checklist-assignments.md#L10-L12, backend/docs/api-change-checklist-assignments.md#L175-L178):
-		> - Clear assignment fields atomically on transitions (skipped/approved/deleted).
-		> - Assignment clearing: On transitions to skipped/approved/deleted, clear `assignedTo` and `assignedAt` atomically with the status change.
-
-- Assignment timestamps should be timezone-aware UTC (RFC3339), set via `datetime.now(timezone.utc)`.
-	- Evidence (backend/docs/api-change-checklist-assignments.md#L13-L14, backend/docs/api-change-checklist-assignments.md#L182-L184):
-		> - Use timezone-aware UTC timestamps via `datetime.now(timezone.utc)` when setting or updating `assignedAt` or other timestamps.
-		> - `assignedAt` (nullable, RFC3339 UTC). Set with `datetime.now(timezone.utc)`.
-
-- `/v1/assignments/my` response must include `etag` in the JSON body (headers optional).
-	- Evidence (backend/docs/api-change-checklist-assignments.md#L19-L24, backend/docs/api-change-checklist-assignments.md#L35-L38):
-		> - `etag` (string)
-		> - `assignedAt` (string, RFC3339 UTC)
-		> - `ETag` header is optional per-item, but the item’s `etag` MUST be included in the JSON body.
-
-- Single-item assignment endpoint (`POST /v1/assignments/{dataset}/{bucket}/{item_id}/assign`) must enforce 409 conflict when draft-assigned to another user, and on success always sets status to draft.
-	- Evidence (backend/docs/assign-single-item-endpoint.md#L22-L29, backend/docs/assign-single-item-endpoint.md#L33-L43):
-		> - **409 Conflict**: Item is already assigned to another user in draft state
-		> 2. **Items assigned to another user (draft status)**: Cannot be assigned (409 Conflict) ❌
-		> **Important**: When an item is assigned, its status is **always set to draft**, regardless of previous state (approved, deleted, skipped, etc.).
-
-- Successful assignment must create/upsert a secondary “assignment document” in the assignments container (materialized view) for fast per-user queries.
-	- Evidence (backend/docs/assign-single-item-endpoint.md#L85-L95):
-		> When an item is successfully assigned, an assignment document is created in the assignments container with:
-		> ...
-		> This materialized view allows fast retrieval of all items assigned to a user via `/v1/assignments/my`.
-
-### Tagging
-
-- Tags must be stored and returned in canonical `group:value` format (lowercase), with normalization (trim/lowercase/dedupe/sort) for deterministic output.
-	- Evidence (backend/docs/tagging_plan.md#L3-L10):
-		> - Canonical form is `group:value` (all lowercase). Inputs are normalized (trimmed, lowercased, deduplicated, sorted for determinism).
-	- Evidence (backend/docs/tagging_plan.md#L54-L60):
-		> - Lowercase group and value; trim whitespace; collapse inner whitespace; accept and normalize `group : value` to `group:value`.
-		> - Deduplicate after normalization; sort ascending for deterministic storage.
-
-- Unknown groups/values are allowed, but known-group behavioral rules (e.g., exclusivity) must be enforced.
-	- Evidence (backend/docs/tagging_plan.md#L5-L10, backend/docs/tagging_plan.md#L60-L66):
-		> - Unknown groups and values are allowed. We do not enforce membership in a hardcoded set.
-		> - For known groups defined in our schema, we still enforce behavioral rules like mutual exclusivity.
-		> - Unknown groups or values are allowed. We only enforce format and known-group rules.
-		> - Exclusive groups (in the known schema) may contain at most one value.
-
-### Snapshot export pipeline
-
-- Snapshot export supports `attachment` and `artifact` delivery modes; missing/empty request body uses defaults equivalent to `attachment`.
-	- Evidence (backend/docs/export-pipeline.md#L26-L34):
-		> Supports `attachment` or `artifact` delivery.
-		> If the request body is omitted or `{}`, the server uses defaults (equivalent to `delivery.mode=attachment`).
-
-- `GET /v1/ground-truths/snapshot` always returns a JSON document payload (not artifacts).
-	- Evidence (backend/docs/export-pipeline.md#L33-L38):
-		> * Always returns a JSON document payload (not storage artifacts)
-
-- Artifact exports must write one JSON file per item plus a manifest under a deterministic path, and the manifest includes `schemaVersion` currently `v2`.
-	- Evidence (backend/docs/export-pipeline.md#L76-L90):
-		> Artifacts are written under:
-		> * `exports/snapshots/{snapshotAt}/ground-truth-{id}.json`
-		> * `exports/snapshots/{snapshotAt}/manifest.json`
-		> ...
-		> * `schemaVersion` (currently `v2`)
-
-- Export processors run before formatting; `merge_tags` merges manual/computed tags into a single sorted union `tags` array.
-	- Evidence (backend/docs/export-pipeline.md#L116-L128):
-		> ### `merge_tags`
-		> Merges tag fields into a single `tags` array on each exported document:
-		> * Reads `manualTags`/`manual_tags` and `computedTags`/`computed_tags`
-		> * Writes `tags` as a sorted union of the two
-
-### Multi-turn history: refs + per-turn tags
-
-- The backend must remain backward compatible with top-level `refs` while supporting optional per-history-item `refs` for assistant messages.
-	- Evidence (backend/docs/multi-turn-refs.md#L5-L8, backend/docs/multi-turn-refs.md#L67-L73):
-		> This change maintains backward compatibility with the existing top-level `refs` field.
-		> 1. **Top-level `refs` field preserved**: The `GroundTruthItem.refs` field at the top level remains unchanged and continues to work as before.
-		> 2. **Optional refs in history**: The `refs` field in `HistoryItem` is optional (defaults to `None`), so existing history items without refs continue to work.
-
-- History item `tags` is optional and defaults to an empty list; when parsing, accept both `msg` and `content` field names.
-	- Evidence (backend/docs/multi-turn-refs.md#L71-L76):
-		> 3. **Optional tags in history**: The `tags` field in `HistoryItem` is optional (defaults to an empty list), so existing history items without tags continue to work.
-		> 4. **Flexible field names**: The parser supports both `msg` and `content` field names for the message text, accommodating different client implementations.
-
-- Tags validation for history items is intentionally permissive: list-of-strings, no value-format restrictions, duplicates allowed.
-	- Evidence (backend/docs/history-tags-feature.md#L140-L150):
-		> - Tags must be a list of strings (enforced by Pydantic)
-		> - No format restrictions on individual tag values
-		> - Empty lists are allowed
-		> - Duplicate tags are allowed (no automatic deduplication at model level)
-
-### Observability and user identity
-
-- Logs must include a `user=<id>` field derived per request; in dev mode it comes from `X-User-Id` header (else `anonymous`).
-	- Evidence (backend/README.md#L334-L341):
-		> Every log line now includes a `user=<id>` field derived per request:
-		> - Dev mode (Easy Auth disabled): uses the `X-User-Id` header if provided, otherwise `anonymous`.
-		> - Tests can set `X-User-Id` to simulate multiple users.
-
-### Local dev and emulator constraints
-
-- When using Cosmos Emulator and multiturn data containing Unicode, the backend must support disabling unicode escaping to avoid emulator parsing bugs.
-	- Evidence (backend/README.md#L104-L121):
-		> - `GTC_COSMOS_DISABLE_UNICODE_ESCAPE=true` (workaround for emulator Unicode bug with multiturn data)
-		> **Solution:** Set `GTC_COSMOS_DISABLE_UNICODE_ESCAPE=true` ... ensures that the backend sends real UTF-8 characters instead of escape sequences...
-
-- Emulator does not support `ARRAY_CONTAINS`, so tag-filtering queries against the emulator cannot rely on server-side `ARRAY_CONTAINS` behavior.
-	- Evidence (backend/docs/cosmos-emulator-limitations.md#L5-L18, backend/README.md#L248-L259):
-		> **Issue:** The Cosmos DB Emulator does not support the `ARRAY_CONTAINS` SQL function...
-		> ...
-		> Integration tests that test tag filtering functionality must be skipped when using the emulator
-		> ...
-		> **Note:** Some tests are skipped when using the Cosmos DB Emulator due to unsupported features (e.g., `ARRAY_CONTAINS` for tag filtering).
-
-## Notes / interpretation boundaries
-
-- Some docs (e.g., drift cleanup) describe the intended “design compliance” direction rather than a fully enforced current behavior; items listed under “Goal/Acceptance criteria” are treated here as target requirements.
-	- Evidence (backend/docs/drift_cleanup.md#L7-L18):
-		> Goal: align current FastAPI endpoints so all Ground Truth writes happen only via SME PUT and Curator PUT...
-		> ... ETag-based concurrency enforced, and reference-specific routes removed.
\ No newline at end of file
diff --git a/.copilot-tracking/subagent/20260121/citation-validation.md b/.copilot-tracking/subagent/20260121/citation-validation.md
deleted file mode 100644
index 2463083..0000000
--- a/.copilot-tracking/subagent/20260121/citation-validation.md
+++ /dev/null
@@ -1,41 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Citation Validation Report (2026-01-21)
-
-## Status
-
-Complete.
-
-- Target document: [.copilot-tracking/research/20260121-high-level-requirements-research.md](../research/20260121-high-level-requirements-research.md)
-- Target document: [.copilot-tracking/research/20260121-high-level-requirements-research.md](../../research/20260121-high-level-requirements-research.md)
-- Validation scope: all Markdown links in the form `[label](path#Lx)` or `[label](path#Lx-Ly)`
-- Run date: 2026-01-21
-
-## Key Findings
-
-- Total citations found: 86
-- Unique citations (deduplicated by `(path, startLine, endLine)`): 69
-- Broken citations: 0
-  - Missing file: 0
-  - Line range beyond EOF: 0
-  - Invalid line numbers (e.g., < 1 or reversed): 0
-
-Notes:
-
-- The document contains repeated citations (expected, because the same source supports multiple requirements).
-- Several citations intentionally use single-line anchors (e.g., `#L79-L79`). These are valid and within file bounds.
-
-## Fix List (Corrected Line Ranges)
-
-No corrections were required.
-
-- Count: 0
-- Document changes applied: none
-
-## Validation Method
-
-- Parsed the target Markdown and extracted all citations matching `...](<path>#L<start>(-L<end>)?)`.
-- For each unique citation:
-  - Verified the target file exists at the repo-relative path.
-  - Counted file lines and validated `1 <= start <= end <= lineCount`.
-
-If you want a stronger “semantic” validation pass (confirming the referenced lines actually contain the claimed behavior), tell me which sections are highest priority and I’ll spot-check them and tighten ranges where appropriate.
diff --git a/.copilot-tracking/subagent/20260121/consolidated-requirements-synthesis.md b/.copilot-tracking/subagent/20260121/consolidated-requirements-synthesis.md
deleted file mode 100644
index 09d8b48..0000000
--- a/.copilot-tracking/subagent/20260121/consolidated-requirements-synthesis.md
+++ /dev/null
@@ -1,298 +0,0 @@
----
-title: Consolidated requirements synthesis
-description: Consolidated high-level requirements derived from subagent research reports, with primary-source evidence and identified ambiguities.
-author: GitHub Copilot (subagent)
-ms.date: 2026-01-21
-ms.topic: reference
-keywords:
-  - requirements
-  - synthesis
-  - ground truth curation
-  - backend
-  - frontend
-estimated_reading_time: 15
----
-
-## Scope
-
-This document consolidates high-level requirements from prior subagent research reports into a single, testable requirements set.
-
-Notes on inputs:
-
-* Two requested inputs were not found at the expected paths:
-  * `.copilot-tracking/subagent/20260121/conventions-and-sources-research.md`
-  * `.copilot-tracking/subagent/20260121/prd-requirements-research.md`
-* Closest available subagent sources used instead:
-  * `.copilot-tracking/subagent/20260121/conventions-research.md`
-  * `.copilot-tracking/subagent/20260121/backend-requirements-research.md`
-  * `.copilot-tracking/subagent/20260121/frontend-requirements-research.md`
-  * `.copilot-tracking/subagent/20260121/cosmos-repo-research.md` (constraints only)
-  * `.copilot-tracking/subagent/20260121/api-logic-research.md` and `.copilot-tracking/subagent/20260121/synthesis-notes.md` (constraints only)
-
-All requirements below include at least one primary source reference (repo file path plus line range) as cited by the subagent reports.
-
-## Top 10 requirements
-
-1. The system must support an assignment-based curation workflow where users work primarily from an assigned-items queue.
-  Evidence: frontend/CODEBASE.md#L128-L148; backend/docs/assign-single-item-endpoint.md#L85-L95.
-1. The backend must enforce optimistic concurrency on writes using Cosmos DB ETags, requiring `If-Match` and returning updated ETags.
-  Evidence: backend/docs/api-change-checklist-assignments.md#L74-L90; backend/docs/api-change-checklist-assignments.md#L168-L178.
-1. The UI must gate approval based on reference completeness (selection, visited state, and minimum key-paragraph length).
-  Evidence: frontend/CODEBASE.md#L75-L79; frontend/src/components/app/defaultCurateInstructions.md#L1-L4.
-1. The backend must implement soft delete via status transitions and exclude deleted items from lists unless explicitly requested.
-  Evidence: backend/CODEBASE.md#L33-L34; frontend/CODEBASE.md#L145-L147.
-1. References must support search-and-add and selected-reference management, including de-duplication by URL and visited tracking.
-  Evidence: frontend/CODEBASE.md#L136-L144.
-1. The system must support multi-turn conversation editing with per-turn metadata and additional approval constraints.
-  Evidence: frontend/IMPLEMENTATION_SUMMARY.md#L88-L112; backend/docs/multi-turn-refs.md#L67-L76.
-1. Snapshot export must support attachment delivery and artifact delivery with a manifest and stable schema versioning.
-  Evidence: backend/docs/export-pipeline.md#L26-L34; backend/docs/export-pipeline.md#L76-L90.
-1. Tag storage and behavior must normalize tags into a canonical `group:value` format and enforce known-group behavioral rules.
-  Evidence: backend/docs/tagging_plan.md#L3-L10; backend/docs/tagging_plan.md#L60-L66.
-1. The backend must be usable with the Cosmos DB Emulator for local development, with documented emulator limitations handled safely.
-  Evidence: backend/docs/cosmos-emulator-limitations.md#L5-L27; backend/app/main.py#L56-L85.
-1. Telemetry must be opt-in and safe-by-default, and the UI must present a user-friendly error boundary.
-  Evidence: frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L13-L18; frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L79-L86.
-
-## Product goals
-
-* Enable curators and SMEs to curate ground-truth items efficiently using an assignment-based workflow with a focused curation workspace.
-  * Evidence:
-    * frontend/CODEBASE.md#L70-L80.
-    * frontend/CODEBASE.md#L128-L148.
-* Support both single-turn (Q/A) and multi-turn (conversation history) ground-truth formats.
-  * Evidence:
-    * frontend/IMPLEMENTATION_SUMMARY.md#L88-L112.
-    * backend/docs/multi-turn-refs.md#L67-L76.
-* Preserve backward compatibility for existing stored item shapes while introducing multi-turn enhancements.
-  * Evidence:
-    * backend/docs/multi-turn-refs.md#L67-L73.
-
-## Frontend UX
-
-* Provide a single-page curation workspace with a multi-pane layout (queue, editor/actions, references, stats/modals).
-  * Evidence:
-    * frontend/CODEBASE.md#L70-L80.
-* Provide an assigned-items queue that supports selection, refresh, and visibility of key item attributes.
-  * Evidence:
-    * frontend/CODEBASE.md#L128-L148.
-* Provide a self-serve assignment action with a configurable default limit.
-  * Evidence:
-    * frontend/README.md#L20-L44.
-    * frontend/CODEBASE.md#L128-L148.
-* Enable editing of item content, including that saving is not blocked by a “change category” requirement.
-  * Evidence:
-    * frontend/CODEBASE.md#L86-L95.
-* Provide references UX with two experiences: search candidates and manage selected/attached references.
-  * Evidence:
-    * frontend/CODEBASE.md#L136-L144.
-* Prevent duplicate reference additions by URL, including disabling add when URL is already present.
-  * Evidence:
-    * frontend/CODEBASE.md#L136-L144.
-* Support opening references in a new tab and marking visited state, and provide user feedback when popups are blocked.
-  * Evidence:
-    * frontend/CODEBASE.md#L136-L144.
-    * frontend/CODEBASE.md#L215-L233.
-* Allow capturing a key paragraph per selected reference and display a length/counter affordance.
-  * Evidence:
-    * frontend/CODEBASE.md#L136-L144.
-* Support removing a reference with an undo window.
-  * Evidence:
-    * frontend/CODEBASE.md#L136-L144.
-    * frontend/CODEBASE.md#L215-L233.
-* Gate approval based on reference completeness and item state (deleted items cannot be approved).
-  * Evidence:
-    * frontend/CODEBASE.md#L75-L79.
-    * frontend/CODEBASE.md#L145-L147.
-* Provide save semantics that detect no-op updates and communicate “No changes”.
-  * Evidence:
-    * frontend/CODEBASE.md#L140-L146.
-* Support soft-delete and restore workflows with clear UI indicators and approval gating.
-  * Evidence:
-    * frontend/CODEBASE.md#L145-L147.
-* Support export as a backend-driven snapshot download.
-  * Evidence:
-    * frontend/CODEBASE.md#L145-L146.
-* Support applying tags to an item using a known tag schema.
-  * Evidence:
-    * frontend/docs/MVP_REQUIREMENTS.md#L22-L27.
-* Surface curation instructions as user-consumable markdown and support fetch/write per dataset with concurrency control.
-  * Evidence:
-    * frontend/docs/MVP_REQUIREMENTS.md#L15-L18.
-    * frontend/CODEBASE.md#L165-L168.
-* Support multi-turn conversation editing with a timeline view, turn operations, and optional context.
-  * Evidence:
-    * frontend/IMPLEMENTATION_SUMMARY.md#L88-L112.
-* Enforce multi-turn approval constraints beyond single-turn, including relevance marking and key-paragraph constraints for relevant references.
-  * Evidence:
-    * frontend/IMPLEMENTATION_SUMMARY.md#L147-L158.
-* Provide keyboard shortcuts for save and approve attempts.
-  * Evidence:
-    * frontend/CODEBASE.md#L184-L184.
-* Provide toast-based feedback for network failures and undo interactions.
-  * Evidence:
-    * frontend/CODEBASE.md#L215-L233.
-* Provide a demo mode that disables telemetry and may use mock providers.
-  * Evidence:
-    * frontend/README.md#L74-L92.
-
-## Backend and API
-
-* Expose a health endpoint at `GET /healthz`.
-  * Evidence:
-    * backend/CODEBASE.md#L14-L15.
-    * backend/CODEBASE.md#L159-L161.
-* Accept snake_case and camelCase inputs but always emit camelCase responses.
-  * Evidence:
-    * backend/CODEBASE.md#L32-L34.
-* Enforce optimistic concurrency using Cosmos ETags on write paths.
-  * Evidence:
-    * backend/CODEBASE.md#L32-L34.
-    * backend/docs/api-change-checklist-assignments.md#L74-L90.
-* Require `If-Match` on assignment write paths and return the updated ETag in both response headers and body.
-  * Evidence:
-    * backend/docs/api-change-checklist-assignments.md#L74-L90.
-    * backend/docs/api-change-checklist-assignments.md#L168-L178.
-* Return 412 on missing or mismatched ETag with stable error codes and include the current ETag in the response.
-  * Evidence:
-    * backend/docs/api-change-checklist-assignments.md#L74-L90.
-    * backend/docs/api-change-checklist-assignments.md#L168-L178.
-* Represent delete via soft delete semantics (`status=deleted`).
-  * Evidence:
-    * backend/CODEBASE.md#L33-L34.
-* Consolidate ground-truth item writes into the SME PUT and Curator PUT endpoints, with reference changes folded into these updates.
-  * Evidence:
-    * backend/docs/api-write-consolidation-plan.v2.md#L28-L36.
-    * backend/docs/api-write-consolidation-plan.v2.md#L64-L66.
-* Keep curator import as a create-only flow.
-  * Evidence:
-    * backend/docs/api-write-consolidation-plan.v2.md#L38-L45.
-    * backend/docs/api-write-consolidation-plan.v2.md#L62-L64.
-* Enforce assignment ownership on SME mutation routes and return a stable ownership error.
-  * Evidence:
-    * backend/docs/api-change-checklist-assignments.md#L82-L86.
-    * backend/docs/api-change-checklist-assignments.md#L168-L176.
-* Clear assignment fields atomically when transitioning to skipped, approved, or deleted.
-  * Evidence:
-    * backend/docs/api-change-checklist-assignments.md#L10-L12.
-    * backend/docs/api-change-checklist-assignments.md#L175-L178.
-* Use timezone-aware UTC timestamps for assignment time fields.
-  * Evidence:
-    * backend/docs/api-change-checklist-assignments.md#L13-L14.
-    * backend/docs/api-change-checklist-assignments.md#L182-L184.
-* Include `etag` in JSON bodies for assignment responses.
-  * Evidence:
-    * backend/docs/api-change-checklist-assignments.md#L19-L24.
-    * backend/docs/api-change-checklist-assignments.md#L35-L38.
-* Provide a single-item assign endpoint that rejects items already draft-assigned to another user and sets status to draft upon successful assignment.
-  * Evidence:
-    * backend/docs/assign-single-item-endpoint.md#L22-L29.
-    * backend/docs/assign-single-item-endpoint.md#L33-L43.
-* Create or upsert a secondary assignment document (materialized view) to enable fast per-user assigned-item queries.
-  * Evidence:
-    * backend/docs/assign-single-item-endpoint.md#L85-L95.
-
-## Data and storage
-
-* Support Cosmos DB as a persistence backend with a storage-layer abstraction.
-  * Evidence:
-    * backend/CODEBASE.md#L24-L30.
-    * backend/app/adapters/repos/base.py#L1-L55.
-* Support a Cosmos emulator mode for local development, without blocking app startup if the emulator is not ready.
-  * Evidence:
-    * backend/app/main.py#L56-L85.
-    * backend/CODEBASE.md#L11-L14.
-* Handle Cosmos emulator query limitations, including lack of `ARRAY_CONTAINS`, by adjusting behavior and skipping tests where appropriate.
-  * Evidence:
-    * backend/docs/cosmos-emulator-limitations.md#L5-L27.
-* Support a safe workaround for Cosmos emulator Unicode/backslash parsing bugs when configured.
-  * Evidence:
-    * backend/README.md#L104-L121.
-    * backend/docs/cosmos-emulator-unicode-workaround.md#L35-L39.
-* Preserve backward compatibility for stored ground-truth fields while extending the multi-turn model (optional refs and tags in history).
-  * Evidence:
-    * backend/docs/multi-turn-refs.md#L67-L76.
-
-## Export
-
-* Support snapshot export with `attachment` and `artifact` delivery modes, with stable defaults when the request is empty.
-  * Evidence:
-    * backend/docs/export-pipeline.md#L26-L34.
-* Return JSON document payloads for snapshot download endpoints.
-  * Evidence:
-    * backend/docs/export-pipeline.md#L33-L38.
-* For artifact delivery, write a deterministic set of per-item files plus a manifest that includes a stable `schemaVersion`.
-  * Evidence:
-    * backend/docs/export-pipeline.md#L76-L90.
-* Run export processors before formatting and support merging tag fields into a single exported `tags` array.
-  * Evidence:
-    * backend/docs/export-pipeline.md#L116-L128.
-
-## Observability and operations
-
-* Include a per-request user identifier in logs.
-  * Evidence:
-    * backend/README.md#L334-L341.
-* Provide opt-in telemetry that is disabled by default and safely no-ops when disabled or in demo mode.
-  * Evidence:
-    * frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L13-L18.
-    * frontend/README.md#L74-L92.
-* Provide a UI error boundary that catches render failures and optionally reports exceptions when telemetry is enabled.
-  * Evidence:
-    * frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L79-L86.
-
-## Security and privacy
-
-* In dev mode, support user simulation via `X-User-Id` to drive per-user behaviors.
-  * Evidence:
-    * backend/README.md#L334-L341.
-    * frontend/README.md#L20-L44.
-* Enforce ownership for assignment mutation endpoints to prevent unauthorized changes.
-  * Evidence:
-    * backend/docs/api-change-checklist-assignments.md#L82-L86.
-    * backend/docs/api-change-checklist-assignments.md#L168-L176.
-* Keep telemetry safe-by-default and opt-in.
-  * Evidence:
-    * frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L13-L18.
-
-## Quality and testing
-
-* Maintain deterministic tag normalization behavior to support stable comparisons, exports, and tests.
-  * Evidence:
-    * backend/docs/tagging_plan.md#L54-L60.
-* Skip or adjust tests in emulator mode when the emulator does not support required query capabilities.
-  * Evidence:
-    * backend/docs/cosmos-emulator-limitations.md#L5-L27.
-    * backend/README.md#L248-L259.
-
-## Cross-cutting constraints and notes
-
-These items are implementation-adjacent but reflect constraints or invariants documented in sources.
-
-* Prefer a layered architecture where API routes remain thin and workflow/state validation occurs in services rather than repository implementations.
-  * Evidence:
-    * backend/CODEBASE.md#L24-L30.
-    * backend/docs/assign-single-item-endpoint.md#L78-L87.
-* Treat emulator compatibility as a first-class constraint for local development.
-  * Evidence:
-    * backend/docs/cosmos-emulator-limitations.md#L5-L27.
-    * backend/README.md#L104-L121.
-
-## Conflicts and ambiguities to resolve
-
-* Reference search and LLM endpoints appear inconsistent between frontend docs.
-  * Evidence:
-    * frontend/docs/MVP_REQUIREMENTS.md#L28-L36.
-    * frontend/CODEBASE.md#L136-L145.
-* Tag semantics differ between “canonical group:value tags” and permissive per-history tags, and it is unclear which UI validations apply to which fields.
-  * Evidence:
-    * backend/docs/tagging_plan.md#L3-L10.
-    * backend/docs/history-tags-feature.md#L140-L150.
-* Tag registry write support is unclear from the frontend requirements: it mentions “create new tags” while also stating no write endpoints for tags.
-  * Evidence:
-    * frontend/docs/MVP_REQUIREMENTS.md#L22-L27.
-* Cosmos emulator unicode workaround coverage has potential drift for non-ground-truth containers.
-  * Evidence:
-    * backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md#L111-L128.
-    * backend/app/adapters/repos/tags_repo.py#L93-L124.
-
diff --git a/.copilot-tracking/subagent/20260121/conventions-and-sources-research.md b/.copilot-tracking/subagent/20260121/conventions-and-sources-research.md
deleted file mode 100644
index 4b94a45..0000000
--- a/.copilot-tracking/subagent/20260121/conventions-and-sources-research.md
+++ /dev/null
@@ -1,170 +0,0 @@
----
-title: Conventions and sources research
-description: Repo guidance on documentation + identification of sources of truth for requirements vs implementation
-author: GitHub Copilot
-ms.date: 2026-01-21
-ms.topic: reference
-keywords:
-  - conventions
-  - requirements
-  - documentation
-  - source of truth
-estimated_reading_time: 6
----
-
-# Conventions and Sources Research — Requirements vs Implementation
-
-## Scope
-
-This note answers:
-
-- Where this repo documents **product requirements** (sources of truth)
-- Where this repo documents **implementation plans/behavior** (derived from requirements)
-- What **doc-writing conventions** exist (including markdown style constraints, if any)
-
-This is “research only” and does not propose changes.
-
-## Primary instruction sources (repo)
-
-### Copilot instruction files
-
-- Backend Copilot conventions: [backend/.github/copilot-instructions.md](../../../backend/.github/copilot-instructions.md#L1-L4)
-  - Timestamp rule: use `datetime.now(timezone.utc)` for timestamp updates.
-  - Typing rule: prefer built-in generics (`dict`, `list`) over `typing.Dict`/`typing.List`.
-  - Workflow hint: use notify MCP `show-notification` when done.
-- Frontend Copilot conventions: [frontend/.github/copilot-instructions.md](../../../frontend/.github/copilot-instructions.md#L1)
-  - Workflow hint: use notify MCP `show-notification` when done.
-
-Note: There are duplicated copies under `workspace-1/` and `workspace-2/` mirroring the same instruction patterns.
-
-### Repo prompt templates (doc/plan authoring)
-
-- “Discussion prep” prompt: [backend/.github/prompts/build_context.prompt.md](../../../backend/.github/prompts/build_context.prompt.md#L1-L6)
-  - Explicitly instructs creating a markdown file in `/docs` for discussion preparation.
-- “Planning” prompt: [backend/.github/prompts/plan.prompt.md](../../../backend/.github/prompts/plan.prompt.md#L1-L16)
-  - Explicitly instructs writing plans to `/plans/*-plan.md`.
-- Frontend planning prompt: [frontend/.github/prompts/plan.prompt.md](../../../frontend/.github/prompts/plan.prompt.md#L1-L14)
-  - Similar planning guidance, but **does not** prescribe a plan output folder.
-
-## Sources of truth (product requirements)
-
-### 1) Canonical requirements doc
-
-- Requirements: [docs/ground-truth-curation-reqs.md](../../../docs/ground-truth-curation-reqs.md)
-  - Explicitly labeled “MVP Requirements”.
-  - Used as the declared requirements source of truth for backend implementation planning: [backend/docs/fastapi-implementation-plan.md](../../../backend/docs/fastapi-implementation-plan.md#L1-L7)
-
-Interpretation:
-
-- Treat `docs/ground-truth-curation-reqs.md` as the top-level contract for “what the system must do” (personas, scope, flows, and open questions).
-
-### 2) Business value framing
-
-- Product value narrative: [BUSINESS_VALUE.md](../../../BUSINESS_VALUE.md)
-  - Declares ground truth as “source of truth for model and agent evaluation”.
-
-Interpretation:
-
-- This doc is not a detailed functional spec, but it is a “why we’re building this” source and can anchor prioritization.
-
-### 3) Backlog / work-item source inputs
-
-- Jira-derived backlog lists:
-  - [prd.json](../../../prd.json)
-  - [prd-refined-1.json](../../../prd-refined-1.json)
-  - [prd-refined-2.json](../../../prd-refined-2.json)
-  - [prd-genericize.json](../../../prd-genericize.json)
-  - [Jira.csv](../../../Jira.csv)
-
-Interpretation:
-
-- These files look like work-item exports (issue IDs, titles, descriptions, status). They are useful for scope tracking and prioritization, but they are not written as a normative requirements spec.
-
-## Secondary “spec/design” docs (normative by area)
-
-These docs behave like “design specs” for specific subsystems and often use explicit language like “authoritative”, “canonical”, or “source of truth”. They appear intended to guide implementation behavior.
-
-### Tagging: manual vs computed
-
-- Manual tags design: [docs/manual-tags-design.md](../../../docs/manual-tags-design.md)
-  - Manual tags “remain authoritative” and are “source of truth” in `manualTags`: [docs/manual-tags-design.md](../../../docs/manual-tags-design.md#L16-L22)
-  - A merged `tags` view may exist, but is “not authoritative”: [docs/manual-tags-design.md](../../../docs/manual-tags-design.md#L43-L44)
-- Computed tags design: [docs/computed-tags-design.md](../../../docs/computed-tags-design.md)
-  - Includes explicit “authoritative manual tags” examples.
-
-Related requirement gap:
-
-- The MVP requirements doc explicitly lists “Authoritative source of truth for tags” as an open question: [docs/ground-truth-curation-reqs.md](../../../docs/ground-truth-curation-reqs.md#L382)
-
-Interpretation:
-
-- Tag “source of truth” is partly specified (manualTags authoritative) but still called out as an open requirements question at the MVP level.
-
-### Frontend runtime configuration
-
-- Runtime config precedence: [docs/frontend-runtime-configuration.md](../../../docs/frontend-runtime-configuration.md)
-  - Declares backend env vars as “authoritative” and frontend `.env` as fallback-only: [docs/frontend-runtime-configuration.md](../../../docs/frontend-runtime-configuration.md#L23-L33)
-
-### Export schema and migration
-
-- Canonical export schema and migration: [docs/json-export-migration-plan.md](../../../docs/json-export-migration-plan.md)
-  - Uses “canonical schema” language for the JSON wire format.
-
-## Implementation sources (how the repo should be built/extended)
-
-### Backend implementation guides
-
-- Backend “authoritative” implementation guide: [backend/CODEBASE.md](../../../backend/CODEBASE.md)
-  - Explicitly says to add clarifications there so it “stays authoritative”: [backend/CODEBASE.md](../../../backend/CODEBASE.md#L222)
-- Backend staged implementation plan: [backend/docs/fastapi-implementation-plan.md](../../../backend/docs/fastapi-implementation-plan.md)
-  - Explicitly derived from the canonical requirements doc.
-- Backend feature/workflow specs (implementation-facing): [backend/docs/](../../../backend/docs/)
-  - Examples: API consolidation plans, export pipeline, tagging plan, emulator limitations/workarounds.
-
-Interpretation:
-
-- `backend/CODEBASE.md` is the “how to work in this codebase” source.
-- `backend/docs/*` appears to be the system’s implementation-oriented spec set.
-
-### Frontend implementation guides
-
-- Frontend codebase guide: [frontend/CODEBASE.md](../../../frontend/CODEBASE.md)
-  - Documents architecture, conventions, and safe extension points.
-- Frontend MVP checklist: [frontend/docs/MVP_REQUIREMENTS.md](../../../frontend/docs/MVP_REQUIREMENTS.md#L1)
-  - Appears to be a status-tracking checklist (items marked `[x]/[ ]`), mixing frontend needs with backend status notes.
-
-Interpretation:
-
-- `frontend/CODEBASE.md` is the best “implementation guide” for frontend structure.
-- `frontend/docs/MVP_REQUIREMENTS.md` is useful operationally, but it reads more like a progress checklist than a normative product requirements doc.
-
-## Markdown / doc-writing style constraints (repo-observable)
-
-### 1) Frontmatter convention is common
-
-Many markdown documents include Microsoft Docs-style YAML frontmatter:
-
-- Example: `ms.date` / `ms.topic` in [docs/manual-tags-design.md](../../../docs/manual-tags-design.md#L1-L12)
-- Example: `ms.date` / `ms.topic` in [frontend/CODEBASE.md](../../../frontend/CODEBASE.md#L1-L12)
-
-Interpretation:
-
-- For “real” documentation/spec files (especially in `docs/` and major `CODEBASE.md` guides), using YAML frontmatter appears to be the convention.
-
-### 2) Markdownlint appears in some artifacts, but no repo config was found
-
-- Multiple `.copilot-tracking/*` documents start with `<!-- markdownlint-disable-file -->` (evidence via grep), suggesting markdownlint is used somewhere in the authoring workflow.
-- No `.markdownlint*` config file was found in this repo (search across common config names returned none).
-
-Interpretation:
-
-- There is no repo-visible markdownlint ruleset to follow, but some generated/tracking artifacts proactively disable markdownlint.
-
-### 3) Formatting/tooling constraints are primarily code-focused
-
-- Frontend uses Biome for lint/format via `biome check --write`: [frontend/package.json](../../../frontend/package.json#L7-L18) and config in [frontend/biome.json](../../../frontend/biome.json)
-  - This is primarily relevant to code (TS/JS/JSON). No repo evidence that markdown is formatted/linted by Biome here.
-
-## Notes on repo layout duplicates
-
-This repo contains `workspace-1/` and `workspace-2/` directories with mirrored docs and `.github` conventions. For “source of truth” purposes, the top-level `docs/`, `backend/`, and `frontend/` folders appear to be the canonical set; the workspace copies look like snapshots or sandboxes.
diff --git a/.copilot-tracking/subagent/20260121/conventions-research.md b/.copilot-tracking/subagent/20260121/conventions-research.md
deleted file mode 100644
index a8302be..0000000
--- a/.copilot-tracking/subagent/20260121/conventions-research.md
+++ /dev/null
@@ -1,185 +0,0 @@
-# Conventions Research — Backend Refactor (Repo/Service/API layering + Cosmos emulator)
-
-## Scope
-
-This note summarizes repository conventions and layering rules relevant to:
-
-- Moving workflow logic out of API/routes and repo implementations into services
-- Handling Cosmos DB emulator differences (including a potential “emulator subclass” strategy)
-
-## Primary Sources
-
-- Architecture overview: [backend/CODEBASE.md](../../../../backend/CODEBASE.md#L8-L36)
-- DI/composition container: [backend/app/container.py](../../../../backend/app/container.py#L1-L322)
-- App startup wiring: [backend/app/main.py](../../../../backend/app/main.py#L1-L85)
-- Repo protocol: [backend/app/adapters/repos/base.py](../../../../backend/app/adapters/repos/base.py#L1-L120)
-- Cosmos repo implementation: [backend/app/adapters/repos/cosmos_repo.py](../../../../backend/app/adapters/repos/cosmos_repo.py#L630-L1890)
-- Emulator docs:
-  - Conditional patch pattern: [backend/CONDITIONAL_PATCH_IMPLEMENTATION.md](../../../../backend/CONDITIONAL_PATCH_IMPLEMENTATION.md#L1-L88)
-  - Emulator limitations: [backend/docs/cosmos-emulator-limitations.md](../../../../backend/docs/cosmos-emulator-limitations.md#L1-L90)
-  - Unicode/backslash workarounds: [backend/docs/cosmos-emulator-unicode-workaround.md](../../../../backend/docs/cosmos-emulator-unicode-workaround.md#L1-L219), [backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md](../../../../backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md#L1-L230)
-
-## Conventions & Layering Rules
-
-### 1) Explicit layered architecture (API → Services → Repos/Adapters)
-
-The backend explicitly documents a layered architecture with composition in a central container:
-
-- API layer: routers in [backend/CODEBASE.md](../../../../backend/CODEBASE.md#L24-L30)
-- Services layer: workflow logic in [backend/CODEBASE.md](../../../../backend/CODEBASE.md#L24-L30)
-- Repositories/adapters layer: Cosmos repo implements a protocol in [backend/CODEBASE.md](../../../../backend/CODEBASE.md#L24-L30)
-- Composition via singleton container: [backend/CODEBASE.md](../../../../backend/CODEBASE.md#L24-L30)
-
-Practical implication for refactor:
-
-- Route handlers should remain “thin” (HTTP parsing/validation + calling services).
-- Services should own workflow/state validation and call repos.
-- Repos should be storage-focused (querying/persistence, ETag enforcement), not business policy.
-
-### 2) Service layer owns state validation; repo methods can be intentionally state-agnostic
-
-The “assign single item” backend design doc explicitly states that `assign_to()` is state-agnostic and that state validation belongs in the service layer:
-
-- “State validation is the responsibility of the service layer” in [backend/docs/assign-single-item-endpoint.md](../../../../backend/docs/assign-single-item-endpoint.md#L78-L87)
-
-This is a strong precedent for moving validations/decision logic out of repo implementations and into services.
-
-### 3) DI/composition pattern: singleton `container` wires repos and services
-
-The DI approach is a simple global singleton container object used by routers and services:
-
-- Container class and global instance: [backend/app/container.py](../../../../backend/app/container.py#L34-L71), [backend/app/container.py](../../../../backend/app/container.py#L321-L322)
-- Container lazily initializes repo/services; tests and lifespan call `init_cosmos_repo()` to bind to the current event loop: [backend/app/container.py](../../../../backend/app/container.py#L50-L56)
-- Cosmos startup is centralized in `startup_cosmos()` and explicitly:
-  - creates repo instances
-  - initializes async clients
-  - validates containers
-  in [backend/app/container.py](../../../../backend/app/container.py#L190-L223)
-
-Practical implication for refactor:
-
-- New services should be registered on `Container` (as attributes) and wired in `init_cosmos_repo()` (or in `__init__` if repo-independent).
-- Route handlers should call `container.<service>` rather than `container.repo` when a workflow exists.
-
-### 4) Current state: routers sometimes call repos directly (mixed style)
-
-There is evidence of both patterns:
-
-- Direct repo calls from API routes: e.g. [backend/app/api/v1/ground_truths.py](../../../../backend/app/api/v1/ground_truths.py#L241-L246) and [backend/app/api/v1/ground_truths.py](../../../../backend/app/api/v1/ground_truths.py#L277-L293)
-- But also service usage from routes: snapshot endpoints call `container.snapshot_service`: [backend/app/api/v1/ground_truths.py](../../../../backend/app/api/v1/ground_truths.py#L135-L151)
-
-Interpretation:
-
-- The repo supports both direct usage and service-orchestrated usage today.
-- The documented architecture (and newer design docs) push toward service-owned workflows.
-
-## Emulator Handling Conventions
-
-### 1) Emulator is expected to be flaky/unready at startup; startup should be fail-soft
-
-Startup intentionally does not block if Cosmos init fails (emulator might not be ready):
-
-- “Don’t block startup; emulator may not be ready yet” in [backend/app/main.py](../../../../backend/app/main.py#L56-L85)
-- Same idea documented in [backend/CODEBASE.md](../../../../backend/CODEBASE.md#L11-L14)
-
-Practical implication:
-
-- Emulator-specific subclasses/branches should preserve fail-soft behavior (don’t crash the app on emulator-only issues where possible).
-
-### 2) Conditional behavior for emulator compatibility is a standard pattern here
-
-The repo already uses “if emulator then alternate implementation” in multiple places:
-
-- Emulator detection via endpoint string: [backend/app/adapters/repos/cosmos_repo.py](../../../../backend/app/adapters/repos/cosmos_repo.py#L639-L641)
-
-**Conditional patching example (`assign_to`)**
-
-- Documented split into main + prod patch path + emulator read-modify-replace path: [backend/CONDITIONAL_PATCH_IMPLEMENTATION.md](../../../../backend/CONDITIONAL_PATCH_IMPLEMENTATION.md#L11-L22)
-- Implemented selection logic in code: [backend/app/adapters/repos/cosmos_repo.py](../../../../backend/app/adapters/repos/cosmos_repo.py#L1719-L1737)
-
-This establishes a repo convention:
-
-- Prefer a single public method that routes internally based on emulator detection.
-- Keep emulator compatibility paths available when Cosmos emulator lacks features.
-
-### 3) Emulator limitations drive in-memory fallbacks and test skips
-
-The emulator limitation on `ARRAY_CONTAINS` is explicitly documented:
-
-- Emulator does not support `ARRAY_CONTAINS`; tag filtering queries fail; tests are skipped: [backend/docs/cosmos-emulator-limitations.md](../../../../backend/docs/cosmos-emulator-limitations.md#L5-L27)
-- Workaround: in-memory tag filtering fallback described in [backend/docs/cosmos-emulator-limitations.md](../../../../backend/docs/cosmos-emulator-limitations.md#L29-L36)
-
-And the Cosmos repo uses emulator-specific fallback for pagination with tags/ref_url:
-
-- “For queries with tags… filter in-memory… use in-memory filtering for ref_url if Cosmos emulator is used…” in [backend/app/adapters/repos/cosmos_repo.py](../../../../backend/app/adapters/repos/cosmos_repo.py#L694-L709)
-
-### 4) Emulator Unicode/backslash issues are handled via flag-driven transforms
-
-There are two related docs here:
-
-1) “Unicode character” normalization doc (smart quotes/dashes etc)
-
-- Workaround is activated by `GTC_COSMOS_DISABLE_UNICODE_ESCAPE=true` and should not be enabled in production: [backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md](../../../../backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md#L27-L33), [backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md](../../../../backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md#L148-L161)
-
-2) “Unicode escape sequence / backslash” bug doc (Base64 encode `refs.content`)
-
-- Final solution is Base64 encoding `refs[*].content` when `GTC_COSMOS_DISABLE_UNICODE_ESCAPE=true`: [backend/docs/cosmos-emulator-unicode-workaround.md](../../../../backend/docs/cosmos-emulator-unicode-workaround.md#L35-L39)
-- Encoding/decoding helpers and `_contentEncoded` marker: [backend/docs/cosmos-emulator-unicode-workaround.md](../../../../backend/docs/cosmos-emulator-unicode-workaround.md#L41-L88)
-- Explicit scope is only `refs[*].content` and only when the flag is true: [backend/docs/cosmos-emulator-unicode-workaround.md](../../../../backend/docs/cosmos-emulator-unicode-workaround.md#L105-L120)
-
-Practical implication:
-
-- Emulator-specific behavior is controlled through settings flags and is intentionally scoped to the minimum necessary.
-- Any emulator subclass approach should respect and reuse these flags rather than introducing a second, parallel flag.
-
-## Settings / Flag Conventions Relevant to Emulator and Backend Selection
-
-- Settings use `GTC_` prefix and load env defaults from `environments/sample.env`: [backend/app/core/config.py](../../../../backend/app/core/config.py#L11-L21)
-- Backend selection is via `REPO_BACKEND` (memory|cosmos): [backend/app/core/config.py](../../../../backend/app/core/config.py#L31-L34)
-- Emulator-related flags:
-  - `USE_COSMOS_EMULATOR`: [backend/app/core/config.py](../../../../backend/app/core/config.py#L41-L46)
-  - `COSMOS_CONNECTION_VERIFY` (self-signed cert): [backend/app/core/config.py](../../../../backend/app/core/config.py#L44-L49)
-  - `COSMOS_DISABLE_UNICODE_ESCAPE`: [backend/app/core/config.py](../../../../backend/app/core/config.py#L47-L52)
-  - `COSMOS_TEST_MODE` (don’t init cosmos in lifespan): [backend/app/core/config.py](../../../../backend/app/core/config.py#L49-L53), [backend/app/main.py](../../../../backend/app/main.py#L58-L69)
-
-## Style / Misc. Engineering Conventions
-
-- Timestamp updates should use UTC: [backend/.github/copilot-instructions.md](../../../../backend/.github/copilot-instructions.md#L1)
-- Prefer built-in generics (`dict`, `list`) over `typing.Dict`/`typing.List`: [backend/.github/copilot-instructions.md](../../../../backend/.github/copilot-instructions.md#L2)
-
-## Guidance for the Planned Refactor
-
-### Moving logic from repos/API into services
-
-Repository conventions support:
-
-- Keeping repo operations storage-centric and state-agnostic when appropriate, with state validation in services: [backend/docs/assign-single-item-endpoint.md](../../../../backend/docs/assign-single-item-endpoint.md#L78-L87)
-- Using the singleton container to expose services (as already done for snapshots): [backend/app/api/v1/ground_truths.py](../../../../backend/app/api/v1/ground_truths.py#L135-L151)
-
-Suggested “shape” aligned with conventions:
-
-- Add/extend a service in `backend/app/services/*_service.py`
-- Wire it on the container in `init_cosmos_repo()` so it gets the active repo
-- Update routers to call the service
-
-### Introducing a Cosmos emulator subclass (interpretation)
-
-No doc explicitly mandates “subclassing,” but the repo has a clear convention of environment-conditional paths:
-
-- Internal switching inside the Cosmos repo based on `is_cosmos_emulator_in_use()`: [backend/app/adapters/repos/cosmos_repo.py](../../../../backend/app/adapters/repos/cosmos_repo.py#L639-L641)
-- `assign_to()` explicitly uses two implementations (patch vs read-modify-replace) selected at runtime: [backend/app/adapters/repos/cosmos_repo.py](../../../../backend/app/adapters/repos/cosmos_repo.py#L1719-L1737)
-
-If you introduce a subclass, it should fit the existing composition pattern:
-
-- The selection should happen in container wiring (e.g., `init_cosmos_repo()`), not in routers.
-- The emulator-specific class should still implement the same `GroundTruthRepo` protocol.
-- It should retain the fail-soft startup posture (emulator might not be ready).
-
-A minimal-risk alternative consistent with existing code:
-
-- Keep a single `CosmosGroundTruthRepo` and add conditional internal branches for emulator-only incompatibilities (the existing pattern).
-
-## Notes / Gaps
-
-- There is no explicit “ports/adapters hexagonal architecture” guidance beyond the documented folder layout and the `GroundTruthRepo` protocol.
-- Observability docs are extensive but not directly prescriptive for repo/service refactors, except indirectly (fail-soft + structured logging patterns).
diff --git a/.copilot-tracking/subagent/20260121/cosmos-repo-research.md b/.copilot-tracking/subagent/20260121/cosmos-repo-research.md
deleted file mode 100644
index c02fe3a..0000000
--- a/.copilot-tracking/subagent/20260121/cosmos-repo-research.md
+++ /dev/null
@@ -1,311 +0,0 @@
-# Cosmos repo + emulator mixing research (2026-01-21)
-
-## Scope
-
-Primary file: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py)
-
-Related emulator/config wiring:
-- [backend/app/container.py](backend/app/container.py)
-- [backend/app/core/config.py](backend/app/core/config.py)
-- [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md)
-- [backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md](backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md)
-- [backend/app/adapters/repos/tags_repo.py](backend/app/adapters/repos/tags_repo.py)
-- [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py)
-
-Goal: produce an inventory of code blocks inside cosmos_repo.py, classify into A/B/C, and propose concrete override seams for a new emulator-specific repo module (`cosmos_emulator.py`) that subclasses (or wraps) the production repo.
-
----
-
-## High-level finding
-
-`CosmosGroundTruthRepo` currently mixes:
-
-- Production Cosmos persistence concerns (SDK client creation, query construction, container calls, concurrency via ETags)
-- Assignment/business workflow logic (sampling allocation, quota math, selection + de-biasing, user id validation)
-- Emulator compatibility hacks (unicode sanitization, backslash sentinel, base64 refs encoding, EXISTS/ARRAY_CONTAINS workarounds, intermittent delete/upsert retries, conditional assignment via read-modify-replace)
-
-This makes it hard to reason about “production correctness” separately from “emulator survivability”, and it forces emulator constraints (like no `EXISTS` in SQL) into the default repo surface.
-
----
-
-## Inventory by category (line-cited)
-
-### A) Pure persistence concerns
-
-These blocks are “Cosmos adapter” responsibilities (query construction, paging/container calls, error mapping), and should remain in the repo layer.
-
-1) Cosmos client/connection policy setup and async loop binding
-- Connection policy + retry options built from settings: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L260-L300)
-- Async client initialization and container client acquisition: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L302-L356)
-
-2) Container existence validation with actionable error messages
-- DB/container validation flow: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L358-L428)
-
-3) Document serialization/deserialization and schema compatibility
-- `_to_doc()` converts model to JSON-safe dict, sets UUID bucket string, ensures updatedAt, persists computed `totalReferences`: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L386-L454)
-- `_from_doc()` normalizes fetched doc, handles legacy `history=None`, validates to model: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L456-L474)
-
-4) Query construction primitives and safe sort clause construction
-- Filter builder (status/dataset/item_id/tags/ref_url): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L526-L605)
-- Sort resolution + stable in-memory sort key: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L607-L671)
-- ORDER BY clause constructed via fixed mapping (no raw user input): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L679-L724)
-
-5) Paginated read path (production Cosmos)
-- Direct query path with ORDER BY + OFFSET/LIMIT, then a second query for total count: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L726-L814)
-
-6) Counting logic
-- Tag-aware count (SQL count for prod, in-memory tag check for emulator): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L913-L1043)
-- Non-tag count uses `SELECT VALUE COUNT(1)` to avoid the “NonValueAggregate” plan issue: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1045-L1107)
-
-7) Basic CRUD paths
-- List-by-dataset query (includes docType exclusion for curation docs): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1109-L1135)
-- `get_gt()` read-item by hierarchical partition key: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1137-L1154)
-- Curation instruction upsert with conditional replace by ETag: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1188-L1275)
-- Assignment document CRUD in secondary container: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1841-L1989)
-
-
-### B) Business/service logic (should be moved out)
-
-These blocks encode *workflow rules* and *domain-level decisions* rather than storage mechanics. They can be preserved, but should move to service layer(s).
-
-1) Total reference semantics are domain/business logic
-- `totalReferences` is derived from either history refs or item refs, and the repo mutates the model during persistence: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L367-L385) and [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L386-L413)
-
-Why this is service logic:
-- It encodes a product/business definition (“history refs take priority”) and impacts UI/filters.
-- The adapter shouldn’t be responsible for deciding business meaning; it should persist what it’s given.
-
-Suggested owner:
-- A new `GroundTruthDerivationsService` (or fold into existing `CurationService` / “ground truth service” if present).
-
-Suggested signatures:
-- `class GroundTruthDerivationsService:`
-  - `def compute_total_references(self, item: GroundTruthItem) -> int`
-  - `def apply_derived_fields(self, item: GroundTruthItem) -> GroundTruthItem` (sets `totalReferences`, possibly `questionLength`, etc.)
-
-2) Sampling allocation, quotas, and selection are assignment workflow
-- The repo contains a full sampling/selection algorithm including:
-  - fetching already-assigned items first
-  - reading sampling allocation config
-  - quota computation via largest remainder
-  - per-dataset candidate queries
-  - round-robin interleave + final global fill
-  - shuffling to debias query ordering
-  [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1388-L1600)
-- Quota computation helper is pure allocation math: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1681-L1716)
-
-Why this is service logic:
-- These are product-level rules about how to distribute assignment opportunities.
-- It is hard to test in isolation when buried in the persistence adapter.
-
-Suggested owner:
-- `AssignmentService` already exists and is the natural owner. It currently orchestrates `self_assign()` and retries by excluding seen IDs: [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py#L44-L152)
-
-Suggested refactor:
-- Move sampling algorithm out of repo into `AssignmentService` (or a new `AssignmentSamplingService` used by `AssignmentService`).
-
-Suggested signatures:
-- `class AssignmentSamplingService:`
-  - `async def sample_candidates(self, *, user_id: str, limit: int, exclude_ids: list[str] | None = None) -> list[GroundTruthItem]`
-  - `def compute_quotas(self, weights: dict[str, float], k: int) -> dict[str, int]`
-
-Repository then exposes *only* persistence queries:
-- `async def list_unassigned_candidates_global(self, *, user_id: str, limit: int, exclude_ids: list[str] | None) -> list[GroundTruthItem]`
-- `async def list_unassigned_candidates_by_dataset_prefix(self, *, dataset_prefix: str, user_id: str, limit: int, exclude_ids: list[str] | None) -> list[GroundTruthItem]`
-
-3) Input validation of `user_id` belongs in API/service
-- Repo rejects user IDs not matching a regex: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1718-L1743)
-
-Why this is service logic:
-- Validation semantics (“allowed chars”) are part of API contract; the repo should not have to know.
-
-Suggested owner:
-- `AssignmentService` (or API layer) should validate `user_id` before calling repository.
-
-Suggested signature:
-- `def validate_user_id(self, user_id: str) -> None` (raise a typed error) or return `bool`.
-
-
-### C) Emulator / compatibility hacks
-
-These blocks exist specifically because the emulator’s behavior differs from production Cosmos DB.
-
-1) Unicode/control-char sanitization, invalid backslash escaping, and restoration
-- Smart punctuation replacements + escape/backslash handling helpers: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L29-L118)
-- Recursive normalization (emulator-only) and restore (sentinel back to backslash): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L121-L219)
-- The public “intent wrapper” `_ensure_utf8_strings()` used by writes: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L430-L454)
-
-Note: The repo also adds a *second* workaround by base64-encoding `refs[*].content` to avoid emulator rejection of “certain character sequences”: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L53-L104) and [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L148-L176)
-
-2) SQL feature gaps: emulator incompatibilities drive in-memory filtering
-- `list_gt_paginated()` routes to emulator path when `tags` or `ref_url` are present and endpoint is localhost: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L748-L770)
-- Emulator pagination path disables SQL tag/ref_url filters (no ARRAY_CONTAINS strategy / no EXISTS) then filters in memory: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L816-L912)
-
-This is consistent with the emulator limitations doc:
-- [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md#L1-L39)
-
-3) Conditional assignment: patch in production, read-modify-replace in emulator
-- Environment detection: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L671-L677)
-- `assign_to()` routes to emulator vs production implementation: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1718-L1752)
-- Production implementation uses `patch_item` with non-parameterized filter_predicate (string interpolation): [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1754-L1838)
-- Emulator implementation uses read-modify-replace: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1840-L1980)
-
-Related design note:
-- [backend/CONDITIONAL_PATCH_IMPLEMENTATION.md](backend/CONDITIONAL_PATCH_IMPLEMENTATION.md#L1-L52)
-
-4) Retry logic for emulator intermittent errors + payload sanitization retry
-- `upsert_gt()` includes special retry paths for:
-  - `etag_mismatch` mapping
-  - intermittent emulator “jsonb type as object key” errors
-  - emulator invalid JSON payload errors triggering a sanitize-and-retry
-  [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1277-L1402)
-- `delete_dataset()` includes emulator-only retry for jsonb/HTTP-format errors, plus retry on deleting curation doc: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1422-L1525)
-
----
-
-## Related emulator knobs and behaviors (outside cosmos_repo.py)
-
-1) Settings flags and Cosmos knobs
-- Emulator flags + unicode escape toggle live in Settings: [backend/app/core/config.py](backend/app/core/config.py#L28-L56)
-
-2) DI container currently always uses `CosmosGroundTruthRepo`
-- Repo wiring picks `CosmosGroundTruthRepo` and only uses endpoint scheme / `USE_COSMOS_EMULATOR` to decide AAD vs key auth, not to change repo class: [backend/app/container.py](backend/app/container.py#L86-L138)
-
-3) Tags repo exists separately and does not currently apply the unicode workaround
-- `CosmosTagsRepo.save_global_tags()` does a plain upsert without `_ensure_utf8_strings`: [backend/app/adapters/repos/tags_repo.py](backend/app/adapters/repos/tags_repo.py#L93-L124)
-
-This matters because [backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md](backend/COSMOS_EMULATOR_UNICODE_WORKAROUND.md#L111-L128) claims tags repo applies normalization; code appears to have drifted.
-
----
-
-## Proposed refactor direction
-
-### Objective
-
-Create a clean production repo with no emulator branches in hot paths, and move emulator constraints into a separate implementation in `backend/app/adapters/repos/cosmos_emulator.py`.
-
-### Recommended shape
-
-1) Production repo remains `CosmosGroundTruthRepo`
-- Keep only production-correct Cosmos SQL usage and patch-based assignment.
-- Keep generic retry policy based on Cosmos SDK RetryOptions (already configured in connection policy).
-
-2) New emulator repo: `CosmosEmulatorGroundTruthRepo`
-- Subclass `CosmosGroundTruthRepo` and override only the minimal behavior differences.
-- Keep emulator-only sanitization and retry logic local to emulator class.
-
-3) Move B-category logic into services
-- Sampling/quotas to `AssignmentService` (or `AssignmentSamplingService`)
-- Derived fields like `totalReferences` to a derivations service
-
----
-
-## Exact override seams for `cosmos_emulator.py`
-
-### Suggested class name
-
-`CosmosEmulatorGroundTruthRepo`
-
-### Suggested constructor signature
-
-Keep it 1:1 with production to minimize DI churn:
-
-- `def __init__(self, endpoint: str, key: str | None, db_name: str, gt_container_name: str, assignments_container_name: str, connection_verify: bool | str | None = None, test_mode: bool = False, credential: Any | None = None) -> None`
-
-(Optionally add `*, emulator_flags: EmulatorFlags | None = None` only if you want to decouple from global `settings`.)
-
-### Minimal subclass surface (recommended)
-
-Override these methods/properties only:
-
-1) Environment detection
-- `def is_cosmos_emulator_in_use(self) -> bool`
-  - Return `True` unconditionally in the emulator subclass to eliminate endpoint string checks.
-
-2) Document transforms
-- Add hook methods in the *production* base class (or override existing wrapper):
-  - `def _pre_write_transform(self, doc: dict[str, Any]) -> dict[str, Any]`
-  - `def _post_read_transform(self, doc: dict[str, Any]) -> dict[str, Any]`
-
-In the emulator subclass:
-- `_pre_write_transform` applies:
-  - unicode/control-char sanitization
-  - backslash sentinel substitution
-  - base64 refs content encoding
-  - (optional) `json.dumps(..., ensure_ascii=True)` roundtrip if needed for emulator
-- `_post_read_transform` applies restore + base64 decode
-
-These behaviors are currently spread across:
-- [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L53-L219)
-- Used in write paths like import/upsert/curation/assignment docs: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L494-L513) and [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1234-L1272) and [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1912-L1934)
-
-3) Pagination capability differences
-- `async def list_gt_paginated(...)`
-  - Emulator subclass should route to `_list_gt_paginated_with_emulator` whenever `tags` or `ref_url` are present.
-  - Production base class keeps the direct SQL path.
-
-Currently: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L748-L770)
-
-4) Assignment method
-- `async def assign_to(self, item_id: str, user_id: str) -> bool`
-  - Emulator subclass forces read-modify-replace flow.
-  - Production base forces patch flow.
-
-Currently: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1718-L1980)
-
-5) Emulator-only retry policy for deletes/upserts
-- `async def upsert_gt(...)` and `async def delete_dataset(...)`
-  - Emulator subclass retains the intermittent emulator bug retries.
-  - Production base can keep ETag handling and rely on SDK retry options, avoiding emulator-specific message matching.
-
-Currently:
-- Upsert retry + sanitize retry: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1277-L1402)
-- Delete dataset retry: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1422-L1525)
-
-If you want an even smaller surface, introduce a single overridable policy method:
-- `def _should_retry_emulator_exception(self, exc: Exception) -> bool`
-And keep retry loops in base calling it.
-
----
-
-## DI container & invariants
-
-### DI wiring change required
-
-`Container.init_cosmos_repo()` currently always constructs `CosmosGroundTruthRepo`: [backend/app/container.py](backend/app/container.py#L86-L138)
-
-To adopt subclassing cleanly, `init_cosmos_repo` should choose:
-- `CosmosEmulatorGroundTruthRepo` when `settings.USE_COSMOS_EMULATOR` is true or endpoint is non-TLS local emulator
-- `CosmosGroundTruthRepo` otherwise
-
-Invariants to preserve:
-- Same constructor args for both repos, so container swap is trivial.
-- `await repo._init()` must still be called on startup (lifespan/startup path relies on async client binding).
-
-### Tests likely to be impacted
-
-1) Unicode tests import the private normalization function directly
-- [backend/tests/unit/test_unicode_fix.py](backend/tests/unit/test_unicode_fix.py#L10-L12)
-
-If normalization moves to emulator module, either:
-- keep `_normalize_unicode_for_cosmos` exported from cosmos_repo.py as a compatibility shim, or
-- update tests to import from emulator module.
-
-2) Unit tests validate `_build_query_filter` tag clause uses `ARRAY_CONTAINS`
-- [backend/tests/unit/test_cosmos_repo.py](backend/tests/unit/test_cosmos_repo.py#L33-L58)
-
-If you split production vs emulator query builders, keep production semantics in `CosmosGroundTruthRepo._build_query_filter` and put emulator differences behind `list_gt_paginated` routing (recommended), so tests remain valid.
-
-3) Assignment tests may depend on selection behavior
-- `AssignmentService` already retries with `exclude_ids`; repo sampling also supports `exclude_ids` via query building. Refactor must maintain that exclusion contract.
-
----
-
-## Recommendation snapshot
-
-- Move B-category logic out of [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py):
-  - `sample_unassigned` + `_compute_quotas` → `AssignmentService` / `AssignmentSamplingService`
-  - `totalReferences` derivation → a derivations service (or domain model)
-  - `user_id` validation → API/service
-- Keep A-category logic in the repo.
-- Create `CosmosEmulatorGroundTruthRepo` in `cosmos_emulator.py` and concentrate C-category logic there, with the override seams listed above.
diff --git a/.copilot-tracking/subagent/20260121/frontend-requirements-research.md b/.copilot-tracking/subagent/20260121/frontend-requirements-research.md
deleted file mode 100644
index fac6dff..0000000
--- a/.copilot-tracking/subagent/20260121/frontend-requirements-research.md
+++ /dev/null
@@ -1,263 +0,0 @@
-# Frontend requirements research (from frontend docs)
-
-Date: 2026-01-21
-Scope: Research-only inference of **high-level** frontend UX requirements that match the existing system.
-
-## Sources reviewed
-
-- [frontend/README.md](../../frontend/README.md)
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md)
-- [frontend/IMPLEMENTATION_SUMMARY.md](../../frontend/IMPLEMENTATION_SUMMARY.md)
-- [frontend/BACKEND_API_CHANGES.md](../../frontend/BACKEND_API_CHANGES.md)
-- [frontend/docs/MVP_REQUIREMENTS.md](../../frontend/docs/MVP_REQUIREMENTS.md)
-- [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](../../frontend/docs/OBSERVABILITY_IMPLEMENTATION.md)
-- [frontend/src/components/app/defaultCurateInstructions.md](../../frontend/src/components/app/defaultCurateInstructions.md)
-
-## Inferred high-level UX requirements
-
-### 1) Runtime configuration and local development
-
-- The frontend must support configuring the backend base URL, OpenAPI schema URL, and a dev-only user identifier via environment variables.
-- In local development, the frontend should call backend APIs under `/v1/...` and rely on a dev proxy to avoid CORS.
-- The UI should support a configurable default “self-serve assignment” limit.
-
-Evidence:
-- [frontend/README.md](../../frontend/README.md#L20-L44)
-
-> - `VITE_API_BASE_URL` – backend base URL …
-> - `VITE_OPENAPI_URL` – OpenAPI spec URL …
-> - `VITE_DEV_USER_ID` – optional dev-only user id sent as `X-User-Id`
-> - `VITE_SELF_SERVE_LIMIT` – optional default for self-serve assignments
-> … all requests to `/v1/...` are proxied to `VITE_API_BASE_URL` …
-
-### 2) App shape: single-page, multi-pane curation workspace
-
-- The app is a single-page experience (no router required by default) with a multi-pane curation workspace.
-- The primary workspace must separate concerns into:
-  - Left: queue of items
-  - Center: editor and actions
-  - Right: references (search vs selected)
-  - Additional views: stats, and other overlays/modals.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L70-L80)
-
-> “A single-page React app.”
-> “UX separation: Left queue, center editor … right references pane … stats view, and modal overlays.”
-
-### 3) Assignment-based workflows and queue navigation
-
-- The primary worklist must be “assigned items” (the curator’s current work queue).
-- The queue should:
-  - Display each item’s ID, status, version, and a truncated question.
-  - Support selecting an item to edit.
-  - Support refreshing/reloading the list.
-  - Highlight deleted items.
-- The UI should provide a “self-serve assignments” action in/near the queue to request more assigned work, using a configurable limit.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L128-L148)
-
-> “items: list shown in Queue, updated on save/refresh”
-> “viewMode: … ‘curate’ … ‘questions’ … ‘stats’”
-> “Self-serve assignments – Queue offers a button to request more assignments (limit via `VITE_SELF_SERVE_LIMIT`).”
-
-### 4) Editing flow (single-turn baseline)
-
-- The editor must allow updating question/answer content for the current item.
-- “Change category” is no longer required for saving; the UI should not block saving on that.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L86-L95)
-
-> “Change category: previously required when Q/A changed; no longer enforced.”
-
-### 5) References: search, add, select, visit/open, and annotate
-
-- The right panel must provide two distinct reference experiences:
-  - Search tab: search for candidate references and add them into the item.
-  - Selected tab: manage references already attached to the item.
-- Search UX requirements:
-  - Display search results and allow adding individual results.
-  - Support multi-select add.
-  - Prevent duplicate additions by URL (disable add when URL already present; de-dup by URL).
-- Selected references UX requirements:
-  - List attached references.
-  - Allow toggling which references are selected.
-  - Allow opening a reference (in a new tab) and marking it as visited.
-  - Allow capturing a “key paragraph” per selected reference and show a counter/length affordance.
-  - Allow removing a reference and undoing that removal within a time window.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L136-L144)
-
-> “ref opening: marks visited and opens in a new tab …”
-> “Search tab: … supports multi-select Add … disabled when URL already present; de-dup by URL.”
-> “Selected tab: … visit/open … key paragraph with counter; Remove supports Undo (8s window).”
-
-Additional evidence (curation guidance shown to users):
-- [frontend/src/components/app/defaultCurateInstructions.md](../../frontend/src/components/app/defaultCurateInstructions.md#L1-L4)
-
-> “Include references you actually visited; for selected ones, write a key paragraph (≥ 40 chars).”
-
-### 6) Approval gating and validation
-
-- The UI must gate “Approve” based on reference completeness:
-  - Requires at least one selected reference.
-  - If references exist, all references must be visited.
-  - Selected references must have a key paragraph of at least 40 characters.
-  - Deleted items cannot be approved.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L75-L79)
-
-> “Approval constraints: … at least one selected reference … all refs visited; selected refs have ≥40 char key paragraph. Deleted items cannot be approved.”
-
-### 7) Save semantics and user feedback
-
-- Save must be idempotent and detect “no-op” updates (avoid re-saving when nothing changed).
-- If there are no changes, the UI should communicate “No changes”.
-- Status-only updates should not be treated as content changes (no need to present them as version bumps in UX).
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L140-L146)
-
-> “Save – computes state fingerprint; if unchanged: returns ‘No changes’.”
-
-### 8) Soft delete / restore workflows
-
-- Users must be able to soft-delete items and restore them.
-- Deleted items should visibly indicate deletion and be non-approvable.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L145-L147)
-
-> “Soft delete – … deleted items show a banner and cannot be approved; restore supported.”
-
-### 9) Export UX
-
-- Export should trigger a backend-driven snapshot download (JSON) rather than an in-app export modal.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L145-L146)
-
-> “Export – triggers backend snapshot download … no in-app JSON modal.”
-
-### 10) Tags: manage existing tags on an item
-
-- The UI must support applying tags to the current ground-truth item.
-- Tag creation is not required (and may not be supported by the backend); the UX should focus on selecting from a known set.
-- Tag validation may be constrained by a fixed schema.
-
-Evidence:
-- [frontend/docs/MVP_REQUIREMENTS.md](../../frontend/docs/MVP_REQUIREMENTS.md#L22-L27)
-
-> “get the known set of existing tags … (`GET /tags/schema`)”
-> “allow the user to create new tags (no write endpoints for tags)”
-> “apply the tags to the current ground truth …”
-> “tag validation … fixed schema”
-
-### 11) Curation instructions
-
-- The UI must surface curation instructions as user-consumable markdown.
-- Instructions are expected to be fetchable and writable per dataset (with concurrency control).
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L86-L92)
-
-> `curationInstructions?: string`
-
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L165-L168)
-
-> “InstructionsPane … collapsible curation instructions surfaced per item”
-
-- [frontend/docs/MVP_REQUIREMENTS.md](../../frontend/docs/MVP_REQUIREMENTS.md#L15-L18)
-
-> “get curation instructions (`GET /datasets/{datasetName}/curation-instructions`)”
-> “write curation instructions (`PUT /datasets/{datasetName}/curation-instructions` with ETag)”
-
-### 12) Multi-turn curation (conversation history)
-
-- The UI must support multi-turn conversation editing in addition to classic single-turn Q/A.
-- It must provide:
-  - A timeline view of conversation turns.
-  - Adding/editing/deleting turns.
-  - An optional “context” field for application/product context.
-  - A mode toggle (single-turn vs multi-turn) with auto-detection and persistence.
-- Multi-turn approval adds requirements beyond single-turn:
-  - Must contain at least one user turn and one agent turn.
-  - All references must be marked with a relevance state.
-  - All “relevant” references must have key paragraphs ≥ 40 characters.
-
-Evidence:
-- [frontend/IMPLEMENTATION_SUMMARY.md](../../frontend/IMPLEMENTATION_SUMMARY.md#L88-L112)
-
-> “Mode Toggle … Auto-detection … Persistence: Saves preference to localStorage”
-> “Reference Relevance Tracking … Requires all references to be marked before approval …”
-> “Application Context … Collapsible Editor …”
-
-- [frontend/IMPLEMENTATION_SUMMARY.md](../../frontend/IMPLEMENTATION_SUMMARY.md#L147-L158)
-
-> “Multi-Turn Approval Requirements … All references marked … All ‘relevant’ references have key paragraphs ≥40 chars …”
-
-### 13) Keyboard shortcuts
-
-- The app should support global shortcuts for primary curation actions:
-  - Cmd/Ctrl+S: save draft.
-  - Cmd/Ctrl+Enter: attempt approve (still gated by validation).
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L184-L184)
-
-> “Keyboard shortcuts: Cmd/Ctrl+S saves draft; Cmd/Ctrl+Enter attempts approve (gated)”
-
-### 14) Error handling and user feedback surfaces
-
-- The UI should provide toast-based feedback for:
-  - Network failures (and keep state consistent).
-  - Undo interactions (reference removal undo window).
-  - Browser popup blocking when opening references in new tabs.
-
-Evidence:
-- [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L215-L233)
-
-> “Undo delete window: 8 seconds via toast action …”
-> “Network failures … show toast and keep state consistent”
-> “Popup blocked on new tab: info toast prompts user”
-
-### 15) Telemetry / observability (optional, safe-by-default)
-
-- Telemetry must be opt-in, safe, and no-op when disabled (including demo mode).
-- The UI should have a user-friendly error boundary for rendering failures, and log exceptions to telemetry when enabled.
-
-Evidence:
-- [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](../../frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L13-L18)
-
-> “Opt-in … Telemetry is disabled by default …”
-> “Safe: No-ops gracefully in demo mode or when configuration is missing”
-
-- [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](../../frontend/docs/OBSERVABILITY_IMPLEMENTATION.md#L79-L86)
-
-> “Error Boundary … Catches rendering errors … Renders a user-friendly fallback UI …”
-
-### 16) Demo mode
-
-- The UI must support a “demo mode” that toggles behavior at startup via environment variables.
-- Demo mode should disable telemetry and may use mock providers/services.
-
-Evidence:
-- [frontend/README.md](../../frontend/README.md#L74-L92)
-
-> “VITE_DEMO_MODE … to enable demo behavior”
-> “Telemetry automatically no-ops in demo mode …”
-
-## Noted doc drift / open questions (for follow-up)
-
-- Search + generation backend availability appears inconsistent across docs:
-  - [frontend/docs/MVP_REQUIREMENTS.md](../../frontend/docs/MVP_REQUIREMENTS.md#L28-L36) states no backend search/LLM endpoints.
-  - [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L136-L145) describes search and generation flows calling backend (`searchReferences`, `callAgentChat`).
-  - This may be historical vs current behavior; confirm actual endpoints and desired UX when offline/backends are missing.
-
-- Export behavior differs by context:
-  - [frontend/CODEBASE.md](../../frontend/CODEBASE.md#L145-L146) states Export triggers snapshot download.
-  - Multi-turn export expansion is described as part of model/export logic in [frontend/IMPLEMENTATION_SUMMARY.md](../../frontend/IMPLEMENTATION_SUMMARY.md#L110-L112). Confirm whether export expansion is implemented in frontend, backend, or both.
diff --git a/.copilot-tracking/subagent/20260121/prd-requirements-research.md b/.copilot-tracking/subagent/20260121/prd-requirements-research.md
deleted file mode 100644
index 9eebea7..0000000
--- a/.copilot-tracking/subagent/20260121/prd-requirements-research.md
+++ /dev/null
@@ -1,176 +0,0 @@
-# PRD Requirements Research — High-level requirements consistent with current system
-
-Date: 2026-01-21
-
-## Scope and method
-
-This report extracts **high-level “shall/should/may” product requirements** from the PRD sources in this repo and then labels each requirement:
-
-- **Matches existing system**: Yes / No / Unclear
-- With a brief justification grounded in **current backend/frontend docs and code**.
-
-Primary requirement sources used:
-
-- [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md) (MVP requirements)
-- [prd-genericize.json](prd-genericize.json) (genericization PRD)
-- [prd.json](prd.json) (backlog items / future requirements)
-
-Notes:
-
-- [ralph/ralph-prd.txt](ralph/ralph-prd.txt) appears to be **agent execution instructions**, not product requirements.
-- [BUSINESS_VALUE.md](BUSINESS_VALUE.md) is treated as **goals/KPIs**, not normative requirements.
-
----
-
-## Supported / consistent requirements (candidate “current PRD”)
-
-### R-001 — Product-agnostic configuration
-
-- Requirement (shall): The system shall be **product-agnostic**, removing hard-coded product/vendor branding and domain-specific content.
-- Requirement (shall): The system shall make **branding** configurable.
-- Requirement (shall): The system shall make **trusted reference domains** configurable.
-- Requirement (shall): The system shall support a **generic demo mode** (generic sample data).
-- Requirement (should): The system should make **manual tags** configurable.
-- Primary evidence: [prd-genericize.json](prd-genericize.json#L13-L18), [prd-genericize.json](prd-genericize.json#L39-L45)
-- Matches existing system: **Yes**
-- System evidence: [frontend/src/config/branding.ts](frontend/src/config/branding.ts#L11), [frontend/src/services/runtimeConfig.ts](frontend/src/services/runtimeConfig.ts#L49), [backend/app/main.py](backend/app/main.py#L44), [frontend/src/config/demo.ts](frontend/src/config/demo.ts#L2-L13)
-
-### R-002 — Bulk import ground-truth items
-
-- Requirement (shall): The system shall allow a curator/admin to **bulk import** generated ground-truth items via an API.
-- Requirement (should): The system should support importing **negative cases** via the same mechanism.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L142-L143)
-- Matches existing system: **Yes**
-- System evidence: [backend/CODEBASE.md](backend/CODEBASE.md#L103)
-
-### R-003 — Assignment visibility isolation
-
-- Requirement (shall): The system shall ensure an SME only sees **their assigned work** (and cannot access other SMEs’ assignments without override).
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L137-L139)
-- Matches existing system: **Yes** (documented)
-- System evidence: [backend/CODEBASE.md](backend/CODEBASE.md#L111-L112)
-
-### R-004 — Self-serve assignment (pull model)
-
-- Requirement (shall): The system shall allow SMEs to **self-serve** (request) a limited number of items to work on.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L44-L44)
-- Matches existing system: **Yes**
-- System evidence: [backend/CODEBASE.md](backend/CODEBASE.md#L111)
-
-### R-005 — SME review actions (draft/save/approve/delete)
-
-- Requirement (shall): The system shall allow an SME to **edit and save**, **approve**, or **delete** an assigned item.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L147-L147)
-- Matches existing system: **Yes**
-- System evidence: [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L29-L160)
-
-### R-006 — Snapshot & export of approved items
-
-- Requirement (shall): The system shall support a **weekly snapshot** and export an immutable JSON artifact containing **approved items + metadata**.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L193-L195)
-- Matches existing system: **Yes**
-- System evidence: [backend/CODEBASE.md](backend/CODEBASE.md#L108), [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L352-L352)
-
-### R-007 — Controlled-vocabulary tagging (apply tags)
-
-- Requirement (shall): The system shall allow an SME to apply **multiple tags from a controlled list** to an item, and those tags shall be reflected in exports.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L167-L169)
-- Matches existing system: **Yes** (apply + schema retrieval)
-- System evidence: [backend/app/api/v1/tags.py](backend/app/api/v1/tags.py#L32), [backend/tests/integration/test_tags_schema_api.py](backend/tests/integration/test_tags_schema_api.py#L6), [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L41-L46)
-
-### R-008 — Soft delete for ground-truth items
-
-- Requirement (shall): The system shall support **soft deletion** of items (hidden from default views/exports while retained for history), and allow deletion via the review workflow.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L173-L175)
-- Matches existing system: **Partial / Unclear** (soft delete exists; restore/cleanup requirements appear incomplete)
-- System evidence: [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1253-L1254), [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L16)
-
-### R-009 — Aggregate stats endpoint
-
-- Requirement (should): The system should provide a stats endpoint for progress/visibility.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L303-L303)
-- Matches existing system: **Yes (aggregate)**
-- System evidence: [backend/app/api/v1/stats.py](backend/app/api/v1/stats.py#L11-L14)
-
----
-
-## Out-of-scope / Not yet supported (per current system)
-
-These are high-level requirements present in PRD sources, but **do not currently match** what the repo documents/implements.
-
-### O-001 — LLM answer generation endpoint/workflow
-
-- Requirement (must/shall): SMEs shall be able to generate an answer using an LLM given the question + relevant context.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L150-L150)
-- Matches existing system: **No**
-- System evidence: [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L33-L36)
-
-### O-002 — AI Search integration for attaching/detaching references
-
-- Requirement (must/shall): The UI shall connect to AI Search and allow SMEs to attach/detach relevant documents, persisting them into item metadata/exports.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L158-L161)
-- Matches existing system: **No**
-- System evidence: [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L29-L31)
-
-### O-003 — Tag administration (manage controlled vocabulary)
-
-- Requirement (must/shall): Admins shall be able to manage the controlled tag list.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L167-L167)
-- Matches existing system: **No**
-- System evidence: [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L24-L27)
-
-### O-004 — SME-specific stats
-
-- Requirement (must/shall): SMEs shall see statistics about *their assigned items* to track progress toward sprint goals.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L188-L188)
-- Matches existing system: **No** (current stats is not per-user)
-- System evidence: [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L48-L48), [backend/app/api/v1/stats.py](backend/app/api/v1/stats.py#L11-L14)
-
-### O-005 — Batch as a first-class concept
-
-- Requirement (must/shall): Items shall be grouped into batches, with a single assignee per batch.
-- Primary evidence: [docs/ground-truth-curation-reqs.md](docs/ground-truth-curation-reqs.md#L182-L182)
-- Matches existing system: **Unclear** (assignments exist; “batch” entity support is not clearly implemented/documented)
-- System evidence: [backend/CODEBASE.md](backend/CODEBASE.md#L111-L112)
-
-### O-006 — Entra-based authentication / access control design + implementation
-
-- Requirement (should/shall): The system should support Entra-based access control (design and/or implementation stories are captured in PRD backlog).
-- Primary evidence: [prd.json](prd.json#L166-L173)
-- Matches existing system: **No** (explicitly documented as placeholder)
-- System evidence: [backend/CODEBASE.md](backend/CODEBASE.md#L120-L120)
-
-### O-007 — Keyword search of ground-truth items
-
-- Requirement (should/shall): The system should provide keyword search over question/answer for locating items.
-- Primary evidence: [prd.json](prd.json#L16-L16)
-- Matches existing system: **No**
-- System evidence: [frontend/docs/MVP_REQUIREMENTS.md](frontend/docs/MVP_REQUIREMENTS.md#L29-L31)
-
-### O-008 — PII detection in approval flow
-
-- Requirement (should/shall): The system should detect PII during (or before) approval to prevent sensitive data from entering the approved set.
-- Primary evidence: [prd.json](prd.json#L94-L95)
-- Matches existing system: **No**
-- System evidence: (no current backend/frontend evidence found in docs indicating PII scanning)
-
-### O-009 — Duplicate detection / prevention
-
-- Requirement (should/shall): The system should detect duplicates (draft vs approved) and prevent SMEs from working on duplicates.
-- Primary evidence: [prd.json](prd.json#L148-L155)
-- Matches existing system: **No**
-- System evidence: (no current backend/frontend evidence found in docs indicating duplicate detection)
-
-### O-010 — Chunking support
-
-- Requirement (should/shall): The system should support chunking (as described in backlog).
-- Primary evidence: [prd.json](prd.json#L40-L41)
-- Matches existing system: **No**
-- System evidence: (no current backend/frontend evidence found in docs indicating chunking support)
-
----
-
-## Quick takeaways
-
-- The **core curation loop** (bulk import → self-serve assignments → SME approve/edit/delete → export snapshot) is well supported and consistently documented.
-- “Stretch” requirements (AI Search attach/detach, LLM generation, RBAC/Entra, per-user stats, tag administration) are present in PRD sources but are **not yet supported** per current repo docs.
diff --git a/.copilot-tracking/subagent/20260121/subagent-reference-audit.md b/.copilot-tracking/subagent/20260121/subagent-reference-audit.md
deleted file mode 100644
index 92887f8..0000000
--- a/.copilot-tracking/subagent/20260121/subagent-reference-audit.md
+++ /dev/null
@@ -1,44 +0,0 @@
-<!-- markdownlint-disable-file -->
-# Subagent Reference Audit (20260121)
-
-## Purpose
-Verify whether all subagent research files in `.copilot-tracking/subagent/20260121/` are referenced by the top-level research document.
-
-## Inputs
-- Subagent folder: `.copilot-tracking/subagent/20260121/`
-- Top-level doc: `.copilot-tracking/research/20260121-high-level-requirements-research.md`
-- Match rule: extract markdown-style links (and raw text occurrences) that include the substring `.copilot-tracking/subagent/20260121/`.
-
-## Files Present In Subagent Folder
-- api-logic-research.md
-- backend-requirements-research.md
-- citation-validation.md
-- consolidated-requirements-synthesis.md
-- conventions-and-sources-research.md
-- conventions-research.md
-- cosmos-repo-research.md
-- frontend-requirements-research.md
-- prd-requirements-research.md
-- synthesis-notes.md
-
-## Files Referenced By Top-Level Doc
-(Links/mentions found in `.copilot-tracking/research/20260121-high-level-requirements-research.md` that reference `.copilot-tracking/subagent/20260121/`.)
-
-- prd-requirements-research.md
-
-## Present But Not Referenced
-- api-logic-research.md
-- backend-requirements-research.md
-- citation-validation.md
-- consolidated-requirements-synthesis.md
-- conventions-and-sources-research.md
-- conventions-research.md
-- cosmos-repo-research.md
-- frontend-requirements-research.md
-- synthesis-notes.md
-
-## Referenced But Missing
-- (none)
-
-## Notes
-- This audit only checks for references using the specific prefix `.copilot-tracking/subagent/20260121/`. If the top-level doc references these files via different relative paths (or without the folder prefix), they will not be counted here.
diff --git a/.copilot-tracking/subagent/20260121/synthesis-notes.md b/.copilot-tracking/subagent/20260121/synthesis-notes.md
deleted file mode 100644
index 507466e..0000000
--- a/.copilot-tracking/subagent/20260121/synthesis-notes.md
+++ /dev/null
@@ -1,143 +0,0 @@
----
-title: Synthesis — Refactor recommendations (API/service/repo boundaries + Cosmos emulator repo)
-description: Consolidated, line-cited recommendations based on prior research notes.
-author: GitHub Copilot (subagent)
-ms.date: 2026-01-21
-ms.topic: reference
----
-
-## 1) Consolidated responsibility boundary proposal (API vs service vs repo)
-
-### API layer (FastAPI routers)
-**Owns:** HTTP surface area only: authn/authz, request parsing, basic request-shape validation, and mapping typed service errors to HTTP responses.
-
-**Concrete examples (current violations):**
-- The SME update endpoint contains a full workflow: ownership enforcement, partial update semantics, approval/status transitions that clear assignment, history parsing (including embedded refs), ETag enforcement, computed tag application, persistence, and best-effort deletion of the assignment document — all inside the router handler in [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L78-L232).
-- The general ground truth update endpoint repeats many of the same workflow concerns: status coercion, explicit business rules rejecting `computedTags` and legacy `tags`, history parsing, ETag enforcement, computed tag application, persistence, and then re-fetch for fresh ETag in [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L232-L369).
-
-**Good existing pattern to emulate:**
-- Snapshot routes delegate domain work to `container.snapshot_service` and keep the handler thin in [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L105-L154).
-
-### Service layer
-**Owns:** domain workflows/state transitions, cross-endpoint invariants, and shared parsing/normalization.
-
-**Recommended service boundaries (aligned to existing code):**
-- **`GroundTruthUpdateService` (new):** consolidate “update item” workflows used by both the SME update route and the general update route.
-  - Should own: partial update policy, history parsing, tag-field acceptance policy, computed tag recomputation policy, and ETag policy (requirement + mismatch translation).
-  - Justification: the routers currently duplicate logic and apply tags/ETags similarly in [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L104-L198) and [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L241-L363).
-- **`AssignmentService` (existing):** own assignment workflows; keep repo calls as persistence/atomic update primitives.
-  - Today `AssignmentService.self_assign` orchestrates retries and uses `repo.assign_to` + assignment-doc materialization in [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py#L44-L146).
-  - Repo currently owns a large “sampling allocation + quota + selection + shuffle” algorithm in `sample_unassigned` and `_compute_quotas`, which is domain workflow rather than persistence and should move into the service layer ([backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1409-L1609), [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1649-L1680)).
-- **`GroundTruthDerivationsService` (new) OR domain model responsibility:** derived-field computation currently lives in the Cosmos adapter.
-  - The repo computes and mutates `totalReferences` during persistence (`_compute_total_references` and `_to_doc`) in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L389-L443). This is a business definition ("history refs win") and should be owned above the storage adapter.
-
-### Repo layer (Cosmos adapters)
-**Owns:** persistence mechanics only: Cosmos client/container I/O, query construction, paging, concurrency primitives (ETag usage), and minimal storage-centric validations.
-
-**Concrete repo responsibilities (current examples):**
-- Interface surface is already formalized via `GroundTruthRepo` in [backend/app/adapters/repos/base.py](backend/app/adapters/repos/base.py#L1-L55).
-- Storage-centric query construction with safe parameterization belongs in the repo (e.g., tag and ref-url clauses, including emulator limitations) in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L500-L590).
-- Cosmos pagination uses a direct SQL path with `ORDER BY` and a separate emulator path with in-memory filtering when needed in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L660-L911).
-
-**What should move out of the repo:**
-- Domain validation of `user_id` currently happens inside `assign_to` (regex) in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1689-L1714). That rule is an API/service contract; repo should assume validated input.
-
-## 2) Recommended `cosmos_emulator.py` design (inherit vs wrapper) + override seams
-
-### Recommendation: subclass (inherit) a production repo
-Create `CosmosEmulatorGroundTruthRepo(CosmosGroundTruthRepo)` in a new module `backend/app/adapters/repos/cosmos_emulator.py`.
-
-**Why inherit (vs wrapper) in this codebase:**
-- The container currently constructs a concrete `CosmosGroundTruthRepo` and wires services immediately afterward in [backend/app/container.py](backend/app/container.py#L83-L161). Keeping a compatible constructor minimizes DI churn.
-- Many emulator differences are already expressed as “same public method, different internal behavior” toggled by `is_cosmos_emulator_in_use()` in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L644-L647). A subclass can make that decision structural (class-level) instead of conditional branches in production code.
-
-### Exact override seams (methods/properties) to isolate emulator behavior
-
-1) **`is_cosmos_emulator_in_use()`**
-- Base currently detects emulator via endpoint string in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L644-L647).
-- Emulator subclass override: return `True` unconditionally.
-
-2) **`list_gt_paginated()` routing + emulator pagination path**
-- Base method conditionally routes to `_list_gt_paginated_with_emulator` when tags/ref_url are present and emulator is in use in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L674-L707).
-- Emulator subclass override: simplify to always use `_list_gt_paginated_with_emulator` when `tags` or `ref_url` are provided, eliminating endpoint checks from production.
-- The emulator path explicitly disables SQL tag/ref_url filters and performs in-memory filtering due to emulator limitations in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L806-L911).
-
-3) **Query filter construction for prod-only SQL features**
-- The `EXISTS(...)` ref-url filter is only injected when `include_ref_url=True` in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L565-L585).
-- Emulator subclass should avoid ref-url SQL filters (continue doing in-memory filtering as implemented) by ensuring `include_ref_url=False` for emulator list operations.
-
-4) **`assign_to()` (patch vs read-modify-replace)**
-- Base currently:
-  - validates `user_id` with a regex
-  - chooses patch vs read-modify-replace based on emulator detection
-  in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1689-L1714).
-- Emulator subclass override:
-  - delegate validation to service (stop duplicating API contract here)
-  - always execute read-modify-replace (compatibility path) and avoid `patch_item` filter predicates.
-
-5) **Write-path normalization + emulator-specific retries (`upsert_gt`)**
-- Base uses `COSMOS_DISABLE_UNICODE_ESCAPE` gating and applies `_ensure_utf8_strings` before upsert/replace in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1099-L1117).
-- Base also includes emulator-specific retry behavior keyed off `is_cosmos_emulator_in_use()` and message matching for invalid JSON payload and intermittent jsonb errors in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1120-L1216).
-- Emulator subclass override: keep these retries (and optionally strengthen them), while production base can be simplified over time to rely on SDK retry policy.
-
-6) **Delete-path retries (`delete_dataset`)**
-- Base has emulator-only retry logic for intermittent errors and HTTP-format issues in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1235-L1360).
-- Emulator subclass override: keep retries local to emulator repo.
-
-### Consolidating the Unicode/backslash/base64 workaround into the emulator repo
-Right now the workaround is spread across:
-- Normalization + base64 helpers (`_normalize_unicode_for_cosmos`, `_restore_unicode_from_cosmos`) in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L45-L176).
-- A repo-level wrapper `_ensure_utf8_strings()` in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L361-L377).
-- Multiple call sites (import, curation upsert, GT upsert) that apply the wrapper in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L448-L479) and [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1079-L1117).
-- Read-path restore inside `_from_doc()` in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L446-L459).
-
-**Recommendation:** define explicit “transform seams” in the base class and override them in the emulator subclass.
-- Base adds two protected methods:
-  - `_transform_doc_for_write(doc: dict[str, Any]) -> dict[str, Any]`
-  - `_transform_doc_for_read(doc: dict[str, Any]) -> dict[str, Any]`
-- Base default implementations are identity.
-- Emulator subclass overrides them to apply `_normalize_unicode_for_cosmos` / `_restore_unicode_from_cosmos` (and thus base64 encode/decode of `refs[*].content`). These behaviors already exist in the module and are gated by `settings.COSMOS_DISABLE_UNICODE_ESCAPE` in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L99-L176).
-
-This turns today’s scattered per-method checks into a single, testable seam.
-
-## 3) Step-by-step migration plan (minimize risk, 6–10 steps)
-
-1) **Introduce typed domain exceptions for stable HTTP mapping**
-   - Replace substring-based ValueError parsing in the assign endpoint with typed errors (router currently maps substrings) in [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L255-L323).
-
-2) **Add `GroundTruthUpdateService` with a single “update workflow” entrypoint**
-   - Start by moving the shared logic (ETag requirement + mismatch mapping, history parsing, computed tags application) out of both routes.
-   - Current duplicated workflow lives in [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L104-L198) and [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L241-L363).
-
-3) **Switch routers to call the service (thin handlers)**
-   - Keep request parsing/validation in the handlers; move the workflow and repo calls into the service.
-   - Use the snapshot route pattern as precedent (service-first) in [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L105-L154).
-
-4) **Extract parsing helpers into a shared module**
-   - Create reusable helpers for history parsing (including refs and expectedBehavior) since both handlers implement near-identical loops in [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L152-L187) and [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L300-L338).
-
-5) **Move assignment sampling logic out of the repo into service**
-   - Shift the allocation/quota/selection algorithm from `CosmosGroundTruthRepo.sample_unassigned` to `AssignmentService` (or a dedicated `AssignmentSamplingService`).
-   - Current algorithm is in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1409-L1609), with quota math in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1649-L1680).
-
-6) **Move derived-field computation (`totalReferences`) out of the repo**
-   - Stop mutating `GroundTruthItem.totalReferences` inside `_to_doc` and compute it in a derivations service before persistence.
-   - Current mutation happens in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L410-L443).
-
-7) **Introduce `CosmosEmulatorGroundTruthRepo` and select it in the container**
-   - Container already derives an emulator/non-TLS condition via `USE_COSMOS_EMULATOR` and endpoint scheme in [backend/app/container.py](backend/app/container.py#L110-L119).
-   - Add a class selection branch there (keep constructor signature compatible).
-   - Emulator flag is defined in settings in [backend/app/core/config.py](backend/app/core/config.py#L28-L45).
-
-8) **Centralize the document transform seam**
-   - Implement `_transform_doc_for_write/_transform_doc_for_read` and route existing `_ensure_utf8_strings` usage through it.
-   - Grounding: normalization functions and wrapper already exist in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L45-L176) and [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L361-L377).
-
-9) **Update tests to target the new seams (keep behavior identical first)**
-   - Keep production behavior unchanged; emulator behavior should remain behind `USE_COSMOS_EMULATOR` or localhost endpoint detection initially.
-
-## 4) Alternatives considered (brief)
-
-- **Flags-in-repo (status quo):** simplest, but keeps production and emulator concerns entangled (e.g., emulator routing in `list_gt_paginated`) in [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L674-L707).
-- **Subclass (recommended):** isolates emulator-only behavior while keeping constructor + protocol stable (container wiring remains straightforward) in [backend/app/container.py](backend/app/container.py#L83-L161).
-- **Strategy object / wrapper:** cleanest purity-wise (inject a “capabilities/transforms” strategy), but higher churn because many internal calls and helper methods aren’t easily intercepted without adding new seams.
diff --git a/.copilot-tracking/subagent/20260122/architecture-refactoring-research.md b/.copilot-tracking/subagent/20260122/architecture-refactoring-research.md
deleted file mode 100644
index 56bcc29..0000000
--- a/.copilot-tracking/subagent/20260122/architecture-refactoring-research.md
+++ /dev/null
@@ -1,321 +0,0 @@
-# Architecture Refactoring Research
-
-**Date:** 2026-01-22
-**Stories:** SA-746 (Refactor API logic into services), SA-424 (Refactor cosmos_repo.py)
-
-## Executive Summary
-
-The backend has significant duplicate logic between `assignments.py` and `ground_truths.py` API endpoints. The `cosmos_repo.py` file is 1,500+ lines and contains emulator-specific workarounds and business logic that should be extracted. The existing service layer pattern (`AssignmentService`, `CurationService`, etc.) provides a clear blueprint for refactoring.
-
----
-
-## 1. Current API Endpoint Structure
-
-### Assignments API (`/v1/assignments/`)
-
-| Endpoint | Method | Purpose | Frontend Usage |
-|----------|--------|---------|----------------|
-| `/self-serve` | POST | Bulk self-assignment | Yes - requestAssignmentsSelfServe |
-| `/my` | GET | List user's assignments | Yes - getMyAssignments |
-| `/{dataset}/{bucket}/{item_id}` | PUT | Update assigned item | Yes - updateAssignedGroundTruth |
-| `/{dataset}/{bucket}/{item_id}/assign` | POST | Assign single item | Yes - assignItem |
-| `/{dataset}/{bucket}/{item_id}/duplicate` | POST | Duplicate as rephrase | Yes - duplicateItem |
-
-### Ground Truths API (`/v1/ground-truths/`)
-
-| Endpoint | Method | Purpose | Frontend Usage |
-|----------|--------|---------|----------------|
-| `` | POST | Bulk import | No (admin) |
-| `` | GET | List all (paginated) | Yes - listAllGroundTruths (Explorer) |
-| `/snapshot` | POST/GET | Export snapshot | Yes - downloadSnapshot |
-| `/{datasetName}` | GET | List by dataset | Unknown |
-| `/{datasetName}/{bucket}/{item_id}` | GET | Get single item | Yes - getGroundTruth |
-| `/{datasetName}/{bucket}/{item_id}` | PUT | Update item | Yes - restoreGroundTruth |
-| `/{datasetName}/{bucket}/{item_id}` | DELETE | Soft delete | Yes - deleteGroundTruth |
-| `/recompute-tags` | POST | Bulk tag recomputation | No (admin) |
-
----
-
-## 2. Duplicate Logic Analysis
-
-### 2.1 Item Update Logic (HIGH PRIORITY)
-
-Both `assignments.py:update_item()` and `ground_truths.py:update_ground_truth()` contain nearly identical logic:
-
-**Shared patterns (~80% overlap):**
-
-```python
-# Both endpoints do:
-1. Fetch item via container.repo.get_gt()
-2. Apply field updates (edited_question, answer, comment, status, refs, manual_tags)
-3. Handle history field parsing (identical HistoryItem conversion)
-4. Handle ETag validation (If-Match header or body.etag)
-5. Apply computed tags via apply_computed_tags()
-6. Persist via container.repo.upsert_gt()
-7. Re-fetch and return updated item
-```
-
-**Differences:**
-
-| Aspect | Assignments | Ground Truths |
-|--------|------------|---------------|
-| Authorization | `assignedTo == user` check | No assignment check |
-| Status handling | Clears assignment on approve/delete | No assignment clearing |
-| Payload model | `AssignmentUpdateRequest` (Pydantic) | `dict[str, Any]` (raw) |
-| Assignment doc cleanup | Yes (deletes assignment doc) | No |
-| `approve` flag | Convenience boolean | Not supported |
-
-### 2.2 History Parsing (MEDIUM PRIORITY)
-
-Identical history parsing code in both endpoints (~30 lines each):
-
-```python
-# Duplicated in assignments.py:140-160 and ground_truths.py:280-305
-history_items = []
-for h in payload.history:
-    refs_data = h.get("refs")
-    refs_list = None
-    if refs_data is not None:
-        refs_list = [r if isinstance(r, Reference) else Reference(**r) for r in refs_data]
-    expected_behavior_data = h.get("expected_behavior") or h.get("expectedBehavior")
-    history_items.append(HistoryItem(
-        role=h["role"],
-        msg=h.get("msg") or h.get("content", ""),
-        refs=refs_list,
-        expected_behavior=expected_behavior_data if isinstance(expected_behavior_data, list) else None,
-    ))
-it.history = history_items
-```
-
-### 2.3 Tag Handling (LOW PRIORITY)
-
-Both endpoints validate and set `manual_tags` with identical patterns:
-
-```python
-if "manual_tags" in provided_fields:  # or "manualTags" in payload
-    try:
-        it.manual_tags = payload.manual_tags or []
-    except ValueError as e:
-        raise HTTPException(status_code=400, detail=str(e))
-```
-
----
-
-## 3. cosmos_repo.py Analysis
-
-### 3.1 File Statistics
-
-- **Total lines:** 1,536
-- **Functions/methods:** 35+
-- **Contains:** Cosmos emulator workarounds, Unicode sanitization, business logic
-
-### 3.2 Logical Components (Candidates for Extraction)
-
-| Component | Lines | Description | Extract To |
-|-----------|-------|-------------|------------|
-| Unicode sanitization | 50-150 | `_sanitize_string_for_cosmos`, `_normalize_unicode_for_cosmos`, `_restore_unicode_from_cosmos` | `cosmos_emulator.py` or `unicode_utils.py` |
-| Base64 encoding for refs | 151-200 | `_base64_encode_refs_content`, `_base64_decode_refs_content` | `cosmos_emulator.py` |
-| Sort security validation | 600-650 | `SortSecurityError`, `_build_secure_sort_clause` | Keep in repo (security) |
-| Quota computation | 1100-1150 | `_compute_quotas` (largest remainder method) | `AssignmentService` |
-| Query building | 500-600 | `_build_query_filter` | Keep in repo (query concern) |
-| Document conversion | 350-450 | `_to_doc`, `_from_doc`, `_to_curation_doc`, `_from_curation_doc` | Keep in repo |
-
-### 3.3 Business Logic in Repository (Should Move to Service)
-
-1. **`sample_unassigned()`** (lines 1000-1150)
-   - Contains allocation/weighting logic
-   - Calls `_compute_quotas()` (policy decision)
-   - Should be: Service orchestrates, repo just queries
-
-2. **`assign_to()`** (lines 1200-1350)
-   - Contains conditional assignment logic
-   - User validation regex check (security concern - keep in service)
-   - Different code paths for emulator vs production
-
-3. **Total reference calculation** (lines 380-390)
-   - `_compute_total_references()` is business logic
-   - Currently in `_to_doc()` - should move to domain model or service
-
-### 3.4 Emulator-Specific Code
-
-The following are emulator workarounds that could be isolated:
-
-```python
-# Pattern: is_cosmos_emulator_in_use() checks
-def is_cosmos_emulator_in_use(self) -> bool:
-    return "localhost" in self._endpoint or "127.0.0.1" in self._endpoint
-
-# Used in:
-- list_gt_paginated() - routes to _list_gt_paginated_with_emulator()
-- _get_filtered_count() - different counting strategy
-- assign_to() - read-modify-replace vs patch
-- upsert_gt() - retry logic for jsonb errors
-- delete_dataset() - retry logic
-```
-
----
-
-## 4. Current Service Layer Structure
-
-### 4.1 Existing Services
-
-| Service | Location | Responsibility |
-|---------|----------|----------------|
-| `AssignmentService` | services/assignment_service.py | Self-assign, assign single, duplicate |
-| `CurationService` | services/curation_service.py | Dataset curation instructions |
-| `SnapshotService` | services/snapshot_service.py | Export snapshots |
-| `TaggingService` | services/tagging_service.py | Tag validation, computed tags |
-| `ValidationService` | services/validation_service.py | Bulk import validation |
-| `SearchService` | services/search_service.py | Azure AI Search adapter |
-| `TagRegistryService` | services/tag_registry_service.py | Tag registry management |
-| `ChatService` | services/chat_service.py | AI chat functionality |
-
-### 4.2 Service Pattern Used
-
-```python
-class AssignmentService:
-    def __init__(self, repo: GroundTruthRepo):
-        self.repo = repo
-    
-    async def self_assign(self, user_id: str, limit: int) -> list[GroundTruthItem]:
-        # Orchestrates repo calls
-        # Contains business logic (retry, shuffle, validation)
-        pass
-```
-
-### 4.3 Container Wiring
-
-```python
-# container.py
-self.assignment_service = AssignmentService(self.repo)
-self.curation_service = CurationService(self.repo)
-self.snapshot_service = SnapshotService(self.repo, ...)
-```
-
----
-
-## 5. Refactoring Recommendations
-
-### 5.1 Phase 1: Extract Update Logic to Service (SA-746)
-
-Create `GroundTruthService` with shared update logic:
-
-```python
-# services/ground_truth_service.py
-class GroundTruthService:
-    def __init__(self, repo: GroundTruthRepo):
-        self.repo = repo
-    
-    async def update_item(
-        self,
-        dataset: str,
-        bucket: UUID,
-        item_id: str,
-        updates: ItemUpdateDTO,
-        user_id: str | None,
-        etag: str | None,
-        *,
-        enforce_assignment: bool = False,
-        clear_assignment_on_complete: bool = False,
-    ) -> GroundTruthItem:
-        """Unified item update logic."""
-        pass
-    
-    def parse_history(self, raw_history: list[dict]) -> list[HistoryItem]:
-        """Parse history from API payload."""
-        pass
-```
-
-### 5.2 Phase 2: Split cosmos_repo.py (SA-424)
-
-**File structure:**
-
-```
-backend/app/adapters/repos/
-├── base.py                    # Protocol (unchanged)
-├── cosmos_repo.py             # Core repo (~800 lines)
-├── cosmos_emulator.py         # Emulator workarounds (~200 lines)
-├── cosmos_unicode.py          # Unicode sanitization (~100 lines)
-└── tags_repo.py               # Tags (unchanged)
-```
-
-**Extract to cosmos_emulator.py:**
-
-- `_base64_encode_refs_content()`
-- `_base64_decode_refs_content()`
-- `_sanitize_string_for_cosmos()`
-- `_normalize_unicode_for_cosmos()`
-- `_restore_unicode_from_cosmos()`
-- `_list_gt_paginated_with_emulator()` (as standalone function)
-- `_assign_to_with_read_modify_replace()` (as standalone function)
-
-**Move to service layer:**
-
-- `_compute_quotas()` → `AssignmentService`
-- `_compute_total_references()` → Domain model (`GroundTruthItem.total_references` property)
-
-### 5.3 Phase 3: Consolidate API Endpoints (Optional)
-
-Consider making `assignments` endpoint a thin wrapper that:
-
-1. Validates assignment ownership
-2. Calls `GroundTruthService.update_item()` with `enforce_assignment=True`
-3. Handles assignment document cleanup
-
----
-
-## 6. Frontend Impact Assessment
-
-### Assignments Endpoints (All Used by Frontend)
-
-- `POST /self-serve` - Used for initial assignment
-- `GET /my` - Used for loading assigned items
-- `PUT /{...}` - Used for all SME edits
-- `POST /{...}/assign` - Used for explicit item assignment
-- `POST /{...}/duplicate` - Used for rephrase creation
-
-### Ground Truths Endpoints
-
-- `GET /` (paginated) - Used by Explorer view
-- `GET /{...}` - Used for item detail fetch
-- `PUT /{...}` - Used for restore from deleted
-- `DELETE /{...}` - Used for soft delete
-- `GET /snapshot` - Used for export download
-
-**Conclusion:** Both endpoint groups are actively used. Refactoring must preserve API contracts.
-
----
-
-## 7. Risk Assessment
-
-| Risk | Likelihood | Impact | Mitigation |
-|------|------------|--------|------------|
-| Breaking API contract | Low | High | Keep endpoint signatures identical |
-| ETag behavior changes | Medium | High | Comprehensive integration tests |
-| Emulator-specific regressions | Medium | Medium | Run test suite with emulator flag |
-| Service layer adds latency | Low | Low | Profile before/after |
-
----
-
-## 8. Next Steps
-
-1. **Create spec** for `GroundTruthService` with unified update logic
-2. **Define interface** for emulator compatibility layer
-3. **Estimate effort** for each phase
-4. **Prioritize** based on Jira story scope
-
----
-
-## Appendix: File Line Counts
-
-```
-backend/app/api/v1/assignments.py     - 242 lines
-backend/app/api/v1/ground_truths.py   - 405 lines
-backend/app/adapters/repos/cosmos_repo.py - 1,536 lines
-backend/app/adapters/repos/base.py    - 57 lines
-backend/app/services/assignment_service.py - 210 lines
-backend/app/services/curation_service.py - 35 lines
-backend/app/services/tagging_service.py - 130 lines
-backend/app/services/validation_service.py - 70 lines
-backend/app/services/snapshot_service.py - 90 lines
-```
diff --git a/.copilot-tracking/subagent/20260122/assignment-error-feedback-research.md b/.copilot-tracking/subagent/20260122/assignment-error-feedback-research.md
deleted file mode 100644
index 8a5ba58..0000000
--- a/.copilot-tracking/subagent/20260122/assignment-error-feedback-research.md
+++ /dev/null
@@ -1,233 +0,0 @@
-# Assignment Error Feedback Research
-
-**Date:** 2025-01-22
-**Topic:** assignment-error-feedback
-**Status:** Complete
-
-## Executive Summary
-
-The assignment error feedback system has partial implementation. The backend returns appropriate status codes (409 for conflicts) with generic messages, but the frontend displays generic "Failed to assign item" errors instead of the backend's specific messages. The toast notification system is in place and supports actionable buttons, but is not leveraged for assignment conflict scenarios.
-
----
-
-## Research Findings
-
-### 1. Backend Response Structure for "Already Assigned" Failure
-
-**Location:** [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L258-L300)
-
-The `assign_item` endpoint handles assignment errors:
-
-```python
-@router.post("/{dataset}/{bucket}/{item_id}/assign", status_code=200)
-async def assign_item(...) -> GroundTruthItem:
-    try:
-        assigned = await container.assignment_service.assign_single_item(...)
-        return assigned
-    except ValueError as e:
-        error_msg = str(e)
-        if "already assigned" in error_msg.lower():
-            raise HTTPException(
-                status_code=409,
-                detail="This item is already assigned to another user.",
-            )
-```
-
-**Service Layer:** [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py#L196-L207)
-
-```python
-if (
-    item.assignedTo
-    and item.assignedTo != user_id
-    and item.status == GroundTruthStatus.draft
-):
-    raise ValueError("Item is already assigned to another user")
-```
-
-### 2. Status Codes and Error Payload
-
-| Scenario | Status Code | Detail Message |
-|----------|-------------|----------------|
-| Item already assigned to another user (draft) | **409 Conflict** | `"This item is already assigned to another user."` |
-| Item not found | **404 Not Found** | `"The requested item could not be found or has been deleted."` |
-| Other validation failures | **400 Bad Request** | `"Unable to assign this item. Please check the item status and try again."` |
-
-**Current Payload Structure:**
-```json
-{
-  "detail": "This item is already assigned to another user."
-}
-```
-
-**Gap Identified:** The payload does NOT include:
-- Error code (e.g., `ASSIGNMENT_CONFLICT`)
-- Current assignee identity (`assignedTo`)
-- Structured error object
-
-The PRD (SA-825) explicitly requires:
-> "Backend returns a specific status code (e.g., 409 Conflict) and a structured error payload (e.g., code + assignedTo) so the frontend can render the correct UX."
-
-### 3. Frontend Error Handling for Assignments
-
-**Location:** [frontend/src/demo.tsx](frontend/src/demo.tsx#L184-L213)
-
-```tsx
-onAssign={async (item) => {
-  try {
-    await assignItem(item.datasetName, item.bucket, item.id);
-    toast("success", `Assigned ${item.id} for curation`);
-  } catch (error) {
-    const message =
-      error instanceof Error
-        ? error.message
-        : "Failed to assign item";
-    toast("error", message);
-  }
-}}
-```
-
-**Service Layer:** [frontend/src/services/assignments.ts](frontend/src/services/assignments.ts#L64-L76)
-
-```typescript
-export async function assignItem(
-  dataset: string,
-  bucket: string,
-  itemId: string,
-): Promise<GroundTruthItemOut> {
-  const { data, error } = await client.POST(
-    "/v1/assignments/{dataset}/{bucket}/{item_id}/assign",
-    { params: { path: { dataset, bucket, item_id: itemId } } },
-  );
-  if (error) throw error;
-  return data as unknown as GroundTruthItemOut;
-}
-```
-
-**Gap Identified:** The frontend:
-1. Throws the raw error object from `openapi-fetch`
-2. Only extracts `error.message` which may not contain the backend's `detail`
-3. Does NOT check status codes or parse structured error responses
-4. Falls back to generic "Failed to assign item" message
-
-### 4. Toast/Notification System
-
-**Location:** [frontend/src/hooks/useToasts.ts](frontend/src/hooks/useToasts.ts)
-
-The toast system supports:
-- **Types:** `success`, `error`, `info`
-- **Actionable buttons:** `actionLabel` and `onAction` callback
-- **Auto-dismiss:** Configurable duration (default 3500ms)
-
-```typescript
-export type Toast = {
-  id: string;
-  kind: "success" | "error" | "info";
-  msg: string;
-  actionLabel?: string;  // ← Supports action buttons
-  onAction?: () => void; // ← Callback for action
-};
-
-const showToast = useCallback(
-  (kind: Toast["kind"], msg: string, opts?: ShowOptions) => { ... },
-  [dismiss],
-);
-```
-
-**Toast Component:** [frontend/src/components/common/Toasts.tsx](frontend/src/components/common/Toasts.tsx)
-
-The UI renders action buttons when provided:
-```tsx
-{t.actionLabel && t.onAction && (
-  <button onClick={() => onActionClick?.(t.id, t.onAction)}>
-    {t.actionLabel}
-  </button>
-)}
-```
-
-### 5. Assignment Logic Locations
-
-#### Backend
-
-| Component | File | Purpose |
-|-----------|------|---------|
-| API Route | [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py#L258) | `POST /v1/assignments/{dataset}/{bucket}/{item_id}/assign` |
-| Service | [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py#L175) | `assign_single_item()` - validation & orchestration |
-| Repository | [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1859) | `assign_to()` - database operations |
-| Error Classes | [backend/app/core/errors.py](backend/app/core/errors.py) | `ConflictError(HTTPException)` - not currently used for assignments |
-
-#### Frontend
-
-| Component | File | Purpose |
-|-----------|------|---------|
-| Service | [frontend/src/services/assignments.ts](frontend/src/services/assignments.ts#L64) | `assignItem()` - API call |
-| Main App | [frontend/src/demo.tsx](frontend/src/demo.tsx#L184) | `onAssign` handler with error display |
-| Toast Hook | [frontend/src/hooks/useToasts.ts](frontend/src/hooks/useToasts.ts) | Toast state management |
-| Toast UI | [frontend/src/components/common/Toasts.tsx](frontend/src/components/common/Toasts.tsx) | Toast rendering |
-
----
-
-## Gap Analysis
-
-| Requirement (SA-825) | Current State | Gap |
-|---------------------|---------------|-----|
-| Clear, specific error message in UI | Generic "Failed to assign item" | ❌ Backend message not surfaced |
-| Toast includes action to view assignee | No action button shown | ❌ Not implemented |
-| Assignee identity surfaced | Not included in error response | ❌ Backend doesn't return `assignedTo` |
-| Structured error payload with code | Plain `detail` string only | ❌ No error code or assignee field |
-
----
-
-## Recommendations
-
-### Backend Changes
-
-1. **Enhance error response structure** in [assignments.py](backend/app/api/v1/assignments.py#L290-L295):
-   ```python
-   raise HTTPException(
-       status_code=409,
-       detail={
-           "code": "ASSIGNMENT_CONFLICT",
-           "message": "This item is already assigned to another user.",
-           "assignedTo": item.assignedTo  # Include current assignee
-       }
-   )
-   ```
-
-2. **Update OpenAPI spec** to document 409 response schema with error structure.
-
-### Frontend Changes
-
-1. **Parse error responses** in [assignments.ts](frontend/src/services/assignments.ts#L64-L76) to extract status code and detail:
-   ```typescript
-   if (error?.status === 409) {
-     throw new AssignmentConflictError(error.body.detail);
-   }
-   ```
-
-2. **Show specific toast with action** in [demo.tsx](frontend/src/demo.tsx#L206-L211):
-   ```typescript
-   toast("error", `Assigned to ${assignee}`, {
-     actionLabel: "View",
-     onAction: () => showAssigneeProfile(assignee)
-   });
-   ```
-
----
-
-## Related Documentation
-
-- [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md) - Endpoint specification
-- [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md) - API change guidelines
-- [prd-refined-2.json](prd-refined-2.json) - SA-825 requirements
-
-## Test Coverage
-
-Existing integration test: [backend/tests/integration/test_assignments_assign_single_cosmos.py](backend/tests/integration/test_assignments_assign_single_cosmos.py#L77-L93)
-
-```python
-async def test_assign_single_item_already_assigned(...):
-    """Test assigning an item already assigned to another user returns 409."""
-    # Verifies 409 status code for conflict scenario
-    r = await async_client.post(f"/v1/assignments/{ds}/{bucket}/{item_id}/assign", ...)
-    assert r.status_code == 409
-```
diff --git a/.copilot-tracking/subagent/20260122/assignment-takeover-research.md b/.copilot-tracking/subagent/20260122/assignment-takeover-research.md
deleted file mode 100644
index 866a30a..0000000
--- a/.copilot-tracking/subagent/20260122/assignment-takeover-research.md
+++ /dev/null
@@ -1,262 +0,0 @@
-# Assignment Takeover Research
-
-**Date:** 2026-01-22  
-**Topic:** Assignment takeover system - allowing SMEs to reassign items currently assigned to others  
-**Issue Reference:** SA-721
-
----
-
-## Executive Summary
-
-The current system **blocks** assignment of draft items that belong to another user (409 Conflict). There is **no existing force/takeover logic** in the codebase. The backend has a clear validation checkpoint that could be modified to accept a `force` parameter. The frontend uses `window.confirm()` for confirmation dialogs throughout the codebase.
-
----
-
-## 1. Current Assignment Flow and Data Model
-
-### Assignment Data Model
-
-**GroundTruthItem** (in [backend/app/domain/models.py](backend/app/domain/models.py)):
-```python
-assignedTo: Optional[str] = Field(default=None, alias="assignedTo")
-assigned_at: Optional[datetime] = Field(default=None, alias="assignedAt")
-```
-
-**AssignmentDocument** (materialized view for fast per-user queries):
-```python
-class AssignmentDocument(BaseModel):
-    id: str  # stable id: "<dataset>|<bucket>|<groundTruthId>"
-    pk: str  # SME user id (partition key)
-    ground_truth_id: str
-    datasetName: str
-    bucket: UUID
-    docType: str = "sme-assignment"
-    schemaVersion: str = "v1"
-```
-
-### Assignment Flow
-
-1. **Self-serve assignment** (`POST /v1/assignments/self-serve`):
-   - Samples unassigned items from the pool
-   - Assigns batch to requesting user
-   - Creates `AssignmentDocument` for each item
-
-2. **Single-item assignment** (`POST /v1/assignments/{dataset}/{bucket}/{item_id}/assign`):
-   - User explicitly selects an item to work on
-   - Validates assignability (see conflict handling below)
-   - Sets `assignedTo`, `assignedAt`, `status=draft`
-   - Creates/updates `AssignmentDocument`
-
----
-
-## 2. Backend Conflict Handling
-
-### Current Validation Logic
-
-Location: [backend/app/services/assignment_service.py#L199-L210](backend/app/services/assignment_service.py#L199-L210)
-
-```python
-# Validate item can be assigned
-# Don't allow assignment of items already assigned to another user in draft state
-if (
-    item.assignedTo
-    and item.assignedTo != user_id
-    and item.status == GroundTruthStatus.draft
-):
-    logger.warning(
-        f"assignment_service.assign_single_item.already_assigned - ..."
-    )
-    raise ValueError("Item is already assigned to another user")
-```
-
-### Assignment Rules (from [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md)):
-
-| Scenario | Current Behavior |
-|----------|------------------|
-| Unassigned draft items | Can be assigned ✅ |
-| Items assigned to another user (draft) | **Cannot be assigned (409 Conflict)** ❌ |
-| Skipped items | Can be reassigned ✅ |
-| Approved items | Can be assigned (moves to draft) ✅ |
-| Deleted items | Can be assigned (moves to draft) ✅ |
-
-### Repository Layer
-
-Location: [backend/app/adapters/repos/cosmos_repo.py#L1719](backend/app/adapters/repos/cosmos_repo.py#L1719)
-
-The `assign_to()` method is **state-agnostic** - it performs the assignment unconditionally. The state validation happens in the **service layer**, not the repository.
-
-Filter predicate in Cosmos patch operation:
-```python
-filter_predicate = (
-    f"FROM c WHERE (c.assignedTo = null OR c.assignedTo = '' "
-    f"OR c.assignedTo = '{user_id}' OR c.status != 'draft')"
-)
-```
-
-This prevents reassigning draft items at the database level too, but could be modified for force-assign scenarios.
-
----
-
-## 3. Existing Force/Override Logic
-
-**Finding: No existing force/override parameter exists.**
-
-The current system has no mechanism to bypass the 409 Conflict for draft items assigned to others. The workaround mentioned in the PRD is:
-> "delete the relevant assignment doc from cosmos and update the assignedTo field on the groundTruth doc"
-
----
-
-## 4. Frontend Confirmation Dialog Patterns
-
-The frontend uses **native `window.confirm()`** dialogs throughout. There are no custom modal confirmation components.
-
-### Examples Found
-
-1. **Unsaved changes warning** ([frontend/src/hooks/useGroundTruth.ts#L390](frontend/src/hooks/useGroundTruth.ts#L390)):
-   ```typescript
-   const confirmed = window.confirm(
-       "You have unsaved changes. Switching items will discard them. Continue?",
-   );
-   ```
-
-2. **Tag removal** ([frontend/src/components/app/editor/TagsEditor.tsx#L65](frontend/src/components/app/editor/TagsEditor.tsx#L65)):
-   ```typescript
-   const ok = window.confirm(`Remove tag "${tag}"?`);
-   ```
-
-3. **Turn deletion** ([frontend/src/components/app/editor/MultiTurnEditor.tsx#L161](frontend/src/components/app/editor/MultiTurnEditor.tsx#L161)):
-   ```typescript
-   if (window.confirm("Are you sure you want to delete this turn?")) {
-   ```
-
-4. **Reference removal** ([frontend/src/components/app/pages/ReferencesSection.tsx#L95](frontend/src/components/app/pages/ReferencesSection.tsx#L95)):
-   ```typescript
-   window.confirm(`Remove reference "${name}"? You can Undo for 8s.`)
-   ```
-
-5. **External link confirmation** ([frontend/src/components/modals/InspectItemModal.tsx](frontend/src/components/modals/InspectItemModal.tsx)):
-   ```typescript
-   const confirmed = confirm(
-       `You are about to visit an external website:\n\n${parsedUrl.hostname}\n\nDo you want to continue?`,
-   );
-   ```
-
-### Modal Infrastructure
-
-- [frontend/src/hooks/useModalKeys.ts](frontend/src/hooks/useModalKeys.ts) - Keyboard handling for modals (Escape to close, Enter to confirm)
-- [frontend/src/components/modals/ModalPortal.tsx](frontend/src/components/modals/ModalPortal.tsx) - Portal for rendering modals
-- [frontend/src/components/modals/InspectItemModal.tsx](frontend/src/components/modals/InspectItemModal.tsx) - Example full modal implementation
-
----
-
-## 5. Assignment Document Structure in Cosmos
-
-### Container: `assignments`
-- **Partition Key:** `/pk` (user ID with prefix `sme:{userId}`)
-
-### Document Structure
-```json
-{
-  "id": "{datasetName}|{bucket}|{itemId}",
-  "pk": "sme:{userId}",
-  "ground_truth_id": "{itemId}",
-  "datasetName": "{datasetName}",
-  "bucket": "{uuid}",
-  "docType": "sme-assignment",
-  "schemaVersion": "v1"
-}
-```
-
-### Related Operations
-
-- **Create/Update:** `repo.upsert_assignment_doc(user_id, item)`
-- **Delete:** `repo.delete_assignment_doc(user_id, dataset, bucket, ground_truth_id)`
-- **List by user:** `repo.list_assignments_by_user(user_id)`
-
----
-
-## 6. Implementation Recommendations
-
-### Backend Changes
-
-1. **Add `force` parameter to `assign_single_item`:**
-   ```python
-   async def assign_single_item(
-       self, dataset: str, bucket: UUID, item_id: str, user_id: str,
-       force: bool = False  # NEW
-   ) -> GroundTruthItem:
-   ```
-
-2. **Modify validation logic:**
-   ```python
-   if (
-       item.assignedTo
-       and item.assignedTo != user_id
-       and item.status == GroundTruthStatus.draft
-       and not force  # NEW: skip check if force=True
-   ):
-       raise ValueError("Item is already assigned to another user")
-   ```
-
-3. **Clean up old assignment document:**
-   When force-assigning, delete the previous user's `AssignmentDocument` before creating the new one.
-
-4. **Update API endpoint:**
-   Accept `force` parameter in request body:
-   ```python
-   @router.post("/{dataset}/{bucket}/{item_id}/assign", status_code=200)
-   async def assign_item(
-       dataset: str,
-       bucket: UUID,
-       item_id: str,
-       body: dict[str, Any] = {},  # NEW: accept { force: true }
-       user: UserContext = Depends(get_current_user),
-   ) -> GroundTruthItem:
-   ```
-
-### Frontend Changes
-
-1. **Catch 409 Conflict** in the assign service call
-2. **Show confirmation dialog** with current assignee info:
-   ```typescript
-   const confirmed = window.confirm(
-       `This item is currently assigned to ${currentAssignee}. ` +
-       `Do you want to take over this assignment?`
-   );
-   ```
-3. **Retry with `force: true`** if user confirms
-
-### API Contract
-
-**Request:**
-```http
-POST /v1/assignments/{dataset}/{bucket}/{item_id}/assign
-Content-Type: application/json
-
-{ "force": true }
-```
-
-**Response:** Same as current (updated `GroundTruthItem`)
-
----
-
-## 7. Related Issues
-
-- **SA-721:** "GTC: Re-think assignment limitations (unassign, vacation, etc.)"
-  - Desired behavior from PRD:
-    1. When a ground truth is already assigned to someone else, a different SME can choose "Assign to me anyway"
-    2. UI prompts for confirmation before taking over the assignment
-    3. After confirmation, assignment is transferred to the current user and the UI reflects the new assignee
-
----
-
-## Key Files Reference
-
-| Component | File |
-|-----------|------|
-| Assignment Service | [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py) |
-| Assignment API Routes | [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py) |
-| Cosmos Repository | [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py) |
-| Domain Models | [backend/app/domain/models.py](backend/app/domain/models.py) |
-| Frontend Assignment Service | [frontend/src/services/assignments.ts](frontend/src/services/assignments.ts) |
-| Design Doc | [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md) |
diff --git a/.copilot-tracking/subagent/20260122/assignment-workflow-research.md b/.copilot-tracking/subagent/20260122/assignment-workflow-research.md
deleted file mode 100644
index 05f7aca..0000000
--- a/.copilot-tracking/subagent/20260122/assignment-workflow-research.md
+++ /dev/null
@@ -1,51 +0,0 @@
----
-topic: assignment-workflow
-jtbd: JTBD-001
-date: 2026-01-22
-status: complete
----
-
-# Research: Assignment Workflow
-
-## Context
-
-The assignment workflow enables users to request, claim, and complete curation work items with ownership and optimistic concurrency protections.
-
-## Sources Consulted
-
-### URLs
-- (None)
-
-### Codebase
-- [backend/CODEBASE.md](backend/CODEBASE.md): Documents the assignment endpoints and ETag/soft-delete conventions.
-- [frontend/README.md](frontend/README.md): Describes dev user simulation header usage.
-
-### Documentation
-- [.copilot-tracking/research/20260121-high-level-requirements-research.md](.copilot-tracking/research/20260121-high-level-requirements-research.md): Consolidates observed requirements and cites detailed sources.
-- [backend/docs/api-change-checklist-assignments.md](backend/docs/api-change-checklist-assignments.md): Captures intended stable API semantics for assignment-related write paths.
-- [backend/docs/assign-single-item-endpoint.md](backend/docs/assign-single-item-endpoint.md): Defines the single-item self-assign behavior and conflict protection.
-
-## Key Findings
-
-1. The system supports a self-serve assignment flow that returns items to work on, and a “my assignments” view scoped to the current user.
-2. Assignment write paths enforce optimistic concurrency via ETag (If-Match or equivalent) and return stable conflict semantics.
-3. Assignment ownership is enforced for mutation endpoints with a stable ownership error when violated.
-4. Status transitions that represent completing work (approve/skip/delete) clear assignment fields atomically.
-5. Doc-only gaps exist in PRD artifacts, but they are not treated as current requirements when not reflected in code.
-
-## Existing Patterns
-
-| Pattern | Location | Relevance |
-|---------|----------|-----------|
-| ETag-based optimistic concurrency | [backend/CODEBASE.md](backend/CODEBASE.md) | Defines write preconditions and conflict behavior |
-| Dev user simulation via header | [frontend/README.md](frontend/README.md) | Supports per-user assignment semantics in development |
-
-## Open Questions
-
-- (None)
-
-## Recommendations for Spec
-
-- Specify assignment lifecycle states and ownership/ETag requirements as stable contracts.
-- Specify what “my assignments” returns (draft items assigned to the current user).
-- Specify expected error behavior for ETag mismatch and ownership violations.
diff --git a/.copilot-tracking/subagent/20260122/batch-validation-research.md b/.copilot-tracking/subagent/20260122/batch-validation-research.md
deleted file mode 100644
index 3b50651..0000000
--- a/.copilot-tracking/subagent/20260122/batch-validation-research.md
+++ /dev/null
@@ -1,241 +0,0 @@
-# Batch Validation Research
-
-**Date:** 2026-01-22
-**Story:** SA-241 - Enhanced error information for batch import
-**Status:** Complete
-
-## Research Questions Answered
-
-### 1. How does the current bulk import validate individual records?
-
-**Location:** [validation_service.py](../../../backend/app/services/validation_service.py)
-
-The validation flow has two stages:
-
-#### Stage 1: Pre-persistence validation (validation_service.py)
-
-```python
-async def validate_bulk_items(items: list[GroundTruthItem]) -> dict[str, list[str]]:
-```
-
-- **Tag validation only**: Currently validates only `manualTags` against the tag registry
-- **Concurrent validation**: Uses `asyncio.gather()` to validate all items concurrently
-- **Caching**: Fetches tag registry once and passes to all validation calls
-- **Error collection**: Returns `dict[item_id, list[errors]]` mapping
-
-**Current validation checks:**
-
-| Check | Field | Implementation |
-|-------|-------|----------------|
-| Tag existence | `manualTags` | Tags must exist in tag registry |
-| Tag format | `manualTags` | Must match `group:value` pattern |
-| Tag rules | `manualTags` | TAG_SCHEMA rules (e.g., uniqueness within group) |
-
-#### Stage 2: Persistence-time validation (cosmos_repo.py)
-
-```python
-async def import_bulk_gt(self, items: list[GroundTruthItem], buckets: int | None = None) -> BulkImportResult:
-```
-
-- **409 Conflict**: Catches duplicate ID errors from Cosmos
-- **Other Cosmos errors**: Generic error message with article URL and ID
-
-### 2. What error information is returned when records fail validation?
-
-**Response model:** `ImportBulkResponse` in [ground_truths.py#L30](../../../backend/app/api/v1/ground_truths.py#L30)
-
-```python
-class ImportBulkResponse(BaseModel):
-    imported: int       # Number of items successfully imported
-    errors: list[str]   # List of error messages
-    uuids: list[str]    # IDs in request order (includes failed items)
-```
-
-**Current error message formats:**
-
-| Source | Format | Example |
-|--------|--------|---------|
-| Tag validation | `"Item '{item_id}': Error {message}"` | `"Item 'test-2': Error Unknown tag 'invalid:tag'."` |
-| Duplicate (409) | `"exists (article: {url}, id: {id})"` | `"exists (article: http://..., id: abc-123)"` |
-| Cosmos error | `"create_failed (article: {url}, id: {id}): {message}"` | `"create_failed (article: unknown, id: xyz): RU exceeded"` |
-
-**Gaps identified:**
-
-1. No structured error format - errors are strings, not objects
-2. No field-level error information
-3. No row/index reference for correlation
-4. No error code for programmatic handling
-5. Pydantic validation errors (if any bypass) would return 422, not included in errors array
-
-### 3. Is Cosmos batch/transactional batch being used, or individual creates?
-
-**Answer: Individual creates (1-by-1)**
-
-**Location:** [cosmos_repo.py#L486](../../../backend/app/adapters/repos/cosmos_repo.py#L486)
-
-```python
-# sequential create to keep simple and clear errors
-for it in items:
-    doc = self._to_doc(it)
-    try:
-        await gt.create_item(doc)  # Individual create
-        success += 1
-    except CosmosHttpResponseError as e:
-        # Error handling...
-```
-
-**Current behavior:**
-
-- Items are created **sequentially** in a loop
-- No transactional batch support
-- Partial success is possible (some items succeed, some fail)
-- No rollback capability
-
-**Cosmos SDK batch capabilities NOT used:**
-
-- `container.execute_batch()` - not used
-- `TransactionalBatch` - not used
-- Bulk executor - not used
-
-### 4. What's the current ImportBulkResponse structure?
-
-**Location:** [models.py#L182](../../../backend/app/domain/models.py#L182)
-
-```python
-class BulkImportResult(BaseModel):  # Internal model
-    imported: int = 0
-    errors: list[str] = Field(default_factory=list)
-
-class ImportBulkResponse(BaseModel):  # API response
-    imported: int       # Number of items successfully imported
-    errors: list[str]   # List of error messages for failed items
-    uuids: list[str]    # IDs in same order as request
-```
-
-**Example successful response:**
-
-```json
-{
-  "imported": 2,
-  "errors": [],
-  "uuids": ["item-1", "item-2"]
-}
-```
-
-**Example partial failure response:**
-
-```json
-{
-  "imported": 1,
-  "errors": ["Item 'item-2': Error Unknown tag 'bad:tag'."],
-  "uuids": ["item-1", "item-2"]
-}
-```
-
-## Current Error Handling Flow
-
-```
-┌─────────────────────────────────────────────────────────────────┐
-│                       import_bulk()                              │
-├─────────────────────────────────────────────────────────────────┤
-│ 1. Generate IDs for items missing them (randomname)              │
-│ 2. validate_bulk_items() ─► Tag validation                       │
-│    ├─ Fetch tag registry once                                    │
-│    ├─ Validate each item's manualTags                            │
-│    └─ Return dict[item_id, errors]                               │
-│ 3. Filter out invalid items                                      │
-│ 4. Apply computed tags to valid items                            │
-│ 5. container.repo.import_bulk_gt() ─► Cosmos persistence         │
-│    ├─ Loop: create_item() for each                               │
-│    ├─ Catch 409: append "exists" error                           │
-│    └─ Catch other: append "create_failed" error                  │
-│ 6. Merge validation errors + persistence errors                  │
-│ 7. Return ImportBulkResponse                                     │
-└─────────────────────────────────────────────────────────────────┘
-```
-
-## Identified Gaps for SA-241
-
-### Gap 1: Unstructured error messages
-
-**Current:** Plain strings  
-**Needed:** Structured error objects with:
-
-- `index`: Row number in original request
-- `itemId`: The item's ID
-- `field`: Which field failed (if applicable)
-- `code`: Error code for programmatic handling
-- `message`: Human-readable message
-
-### Gap 2: No batch processing
-
-**Current:** Sequential `create_item()` calls  
-**Needed:** Cosmos transactional batch for:
-
-- Better performance (single network round-trip)
-- Atomic operations within partition
-- RU efficiency
-
-### Gap 3: Limited validation scope
-
-**Current:** Only manualTags validated  
-**Needed:** Consider validating:
-
-- Required fields
-- Field length limits
-- Reference URL format
-- Custom business rules
-
-### Gap 4: No partial rollback capability
-
-**Current:** Items persist as they succeed  
-**Needed:** Consider all-or-nothing mode option
-
-### Gap 5: No validation summary
-
-**Current:** Just error list  
-**Needed:** Summary stats:
-
-- Total items received
-- Validation failures count
-- Persistence failures count
-- By-field error breakdown
-
-## Recommendations for SA-241
-
-1. **Define structured error model:**
-
-   ```python
-   class ImportError(BaseModel):
-       index: int
-       itemId: str | None
-       field: str | None
-       code: str  # e.g., "INVALID_TAG", "DUPLICATE_ID"
-       message: str
-   ```
-
-2. **Enhance ImportBulkResponse:**
-
-   ```python
-   class ImportBulkResponse(BaseModel):
-       imported: int
-       failed: int
-       total: int
-       errors: list[ImportError]  # Structured errors
-       uuids: list[str]
-   ```
-
-3. **Consider Cosmos batch operations** for performance (separate story)
-
-4. **Add validation for additional fields** as needed
-
-## Files Analyzed
-
-| File | Purpose |
-|------|---------|
-| [ground_truths.py](../../../backend/app/api/v1/ground_truths.py) | API endpoint, response models |
-| [validation_service.py](../../../backend/app/services/validation_service.py) | Pre-persistence validation |
-| [cosmos_repo.py](../../../backend/app/adapters/repos/cosmos_repo.py) | Database operations |
-| [models.py](../../../backend/app/domain/models.py) | Domain models |
-| [tagging_service.py](../../../backend/app/services/tagging_service.py) | Tag validation logic |
-| [test_bulk_import_tag_validation.py](../../../backend/tests/unit/test_bulk_import_tag_validation.py) | Test coverage |
diff --git a/.copilot-tracking/subagent/20260122/ci-code-quality-research.md b/.copilot-tracking/subagent/20260122/ci-code-quality-research.md
deleted file mode 100644
index d404446..0000000
--- a/.copilot-tracking/subagent/20260122/ci-code-quality-research.md
+++ /dev/null
@@ -1,219 +0,0 @@
-# CI Code Quality Research
-
-**Date:** 2026-01-22
-**Topic:** ci-code-quality
-**Jira:** SA-745 - Enforce formatting and linters in CI, reconcile drift
-
----
-
-## Summary
-
-The repository has established linting and formatting tooling for both backend (Python) and frontend (TypeScript), with backend pre-commit hooks configured but **no frontend pre-push hooks**. There is **active drift in the frontend** that needs reconciliation before CI enforcement.
-
----
-
-## 1. Backend (Python) Configuration
-
-### Package Manager
-
-- **uv** - Modern Python package manager from Astral
-- Lock file: [backend/uv.lock](backend/uv.lock)
-
-### Linting/Formatting Tools
-
-| Tool | Purpose | Configuration |
-|------|---------|---------------|
-| **Ruff** | Linting + formatting | [backend/pyproject.toml](backend/pyproject.toml) `[tool.ruff]` |
-| **Black** | Formatting (legacy, likely superseded by ruff) | `[tool.black]` section |
-| **ty** | Type checking | `[tool.ty]` section |
-| **Vulture** | Dead code detection | `[tool.vulture]` section |
-
-### Ruff Configuration
-
-```toml
-[tool.ruff]
-line-length = 100
-
-[tool.ruff.lint]
-select = [
-    "F",       # Pyflakes (F401, F841, F811, etc.)
-    "ERA",     # Commented code (ERA001)
-    "RUF059",  # Unused unpacked variables
-]
-```
-
-### Pre-commit Hooks (Backend Only)
-
-File: [backend/.pre-commit-config.yaml](backend/.pre-commit-config.yaml)
-
-| Hook | Stage | Scope |
-|------|-------|-------|
-| `ruff-format` | pre-commit | `^backend/.*\.py$` |
-| `ruff` (lint + fix) | pre-commit | `^backend/.*\.py$` |
-| `ty` | pre-commit | `^backend/app/.*\.py$` |
-| `pytest` | **pre-push** | Backend tests |
-
-### Current Drift Status
-
-```
-✅ Backend lint: All checks passed!
-✅ Backend format: 67 files already formatted
-```
-
-**No drift in backend.**
-
----
-
-## 2. Frontend (TypeScript) Configuration
-
-### Package Manager
-
-- **npm** - Standard Node.js package manager
-- Lock file: [frontend/package-lock.json](frontend/package-lock.json)
-
-### Linting/Formatting Tools
-
-| Tool | Purpose | Configuration |
-|------|---------|---------------|
-| **Biome** | Linting + formatting | [frontend/biome.json](frontend/biome.json) |
-| **TypeScript** | Type checking | `tsc -b` via npm script |
-| **Knip** | Dead code detection | [frontend/knip.json](frontend/knip.json) |
-
-### Biome Configuration
-
-```json
-{
-    "formatter": { "enabled": true },
-    "linter": {
-        "enabled": true,
-        "rules": {
-            "correctness": {
-                "noUnusedImports": "error",
-                "noUnusedVariables": "warn",
-                "noUnusedFunctionParameters": "warn",
-                "noUnusedPrivateClassMembers": "warn"
-            }
-        }
-    }
-}
-```
-
-### NPM Scripts
-
-```json
-{
-    "lint": "biome check --write",
-    "typecheck": "tsc -b --pretty false"
-}
-```
-
-### Pre-commit/Pre-push Hooks
-
-**None configured.** No husky, lefthook, or lint-staged packages present.
-
-### Current Drift Status
-
-```
-❌ Frontend: Found 31 errors (formatting + organize imports)
-```
-
-**Active drift detected:**
-
-- 2 config files need formatting (`biome.json`, `knip.json`)
-- Multiple source files have import organization issues
-- Formatting issues in `vitest.config.ts` and source files
-
----
-
-## 3. CI Workflow Analysis
-
-File: [.github/workflows/gtc-ci.yml](.github/workflows/gtc-ci.yml)
-
-### Current CI Checks
-
-| Check | Type | Status |
-|-------|------|--------|
-| Backend unit tests | pytest | ✅ Runs |
-| Backend integration tests | pytest | ✅ Runs |
-| `ty check app` | Type checking | ✅ Runs |
-| OpenAPI spec freshness | git diff | ✅ Runs |
-| Frontend types check | `api:types:check` | ✅ Runs |
-| Frontend tests | vitest | ✅ Runs |
-| **Backend lint/format** | ruff | ❌ **Not in CI** |
-| **Frontend lint/format** | biome | ❌ **Not in CI** |
-
-### Missing CI Jobs
-
-1. **Backend linting:** `uv run ruff check app`
-2. **Backend formatting:** `uv run ruff format app --check`
-3. **Frontend linting:** `npx biome check`
-
----
-
-## 4. Recommendations for SA-745
-
-### Phase 1: Reconcile Drift
-
-1. Run `npm run lint` in frontend to auto-fix 31 errors
-2. Commit formatting fixes separately for clean history
-
-### Phase 2: Add CI Enforcement
-
-Add to `.github/workflows/gtc-ci.yml`:
-
-```yaml
-- name: Backend lint
-  working-directory: backend
-  run: uv run ruff check app
-
-- name: Backend format check
-  working-directory: backend
-  run: uv run ruff format app --check
-
-- name: Frontend lint
-  working-directory: frontend
-  run: npx biome check
-```
-
-### Phase 3: Add Frontend Pre-push Hooks
-
-Options:
-
-1. **Husky** - Most popular, npm-based
-2. **Lefthook** - Fast, language-agnostic
-3. **Extend pre-commit** - Add frontend hooks to existing backend config
-
-Recommended: Extend existing `pre-commit` framework (already in dev dependencies) with frontend hooks.
-
-### Phase 4: Environment Alignment
-
-- Document required tool versions in README
-- Consider adding `engines` field to `package.json`
-- Ensure `pre-commit install` is documented in setup instructions
-
----
-
-## 5. File References
-
-| File | Purpose |
-|------|---------|
-| [backend/pyproject.toml](backend/pyproject.toml) | Python tools config |
-| [backend/.pre-commit-config.yaml](backend/.pre-commit-config.yaml) | Pre-commit hooks |
-| [frontend/biome.json](frontend/biome.json) | Biome linter/formatter config |
-| [frontend/package.json](frontend/package.json) | NPM scripts and dependencies |
-| [.github/workflows/gtc-ci.yml](.github/workflows/gtc-ci.yml) | CI workflow |
-
----
-
-## 6. Quick Fix Commands
-
-```bash
-# Fix frontend drift
-cd frontend && npm run lint
-
-# Verify backend is clean
-cd backend && uv run ruff check app && uv run ruff format app --check
-
-# Install pre-commit hooks (backend)
-cd backend && uv run pre-commit install
-```
diff --git a/.copilot-tracking/subagent/20260122/code-conventions-research.md b/.copilot-tracking/subagent/20260122/code-conventions-research.md
deleted file mode 100644
index 5996a64..0000000
--- a/.copilot-tracking/subagent/20260122/code-conventions-research.md
+++ /dev/null
@@ -1,250 +0,0 @@
-# Code Conventions Research
-
-**Research Date:** 2025-01-22  
-**Related Jira Stories:** SA-249, SA-250, SA-245
-
----
-
-## Executive Summary
-
-This research identifies patterns requiring standardization across three areas:
-1. **Pydantic models vs JSON dump** - Limited issues; most API endpoints correctly return Pydantic models
-2. **Exception handling** - Significant use of generic `Exception` catches that could use specific Cosmos error types
-3. **Logging patterns** - Two `print()` statements in app code; mature logging infrastructure using `extra={}` pattern
-
----
-
-## 1. JSON Dumps vs Pydantic Models (SA-249)
-
-### Findings
-
-The codebase generally handles Pydantic models correctly. FastAPI endpoints return Pydantic models directly, letting FastAPI handle JSON serialization.
-
-#### Locations Using `json.dumps()` or `model_dump()`
-
-| File | Line | Context | Assessment |
-|------|------|---------|------------|
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L409) | 409 | `model_dump(mode="json", by_alias=True)` | **Appropriate** - Preparing data for Cosmos DB storage |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1068) | 1068 | `model_dump(mode="json", by_alias=True)` | **Appropriate** - Document upsert |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1917) | 1917 | `model_dump(mode="json", by_alias=True)` | **Appropriate** - Assignment document upsert |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L380) | 380 | `json.loads(json.dumps(sanitized, ensure_ascii=True))` | **Appropriate** - Unicode sanitization workaround |
-| [snapshot_service.py](backend/app/services/snapshot_service.py#L81) | 81 | `model_dump(mode="json", ...)` for export items | **Appropriate** - Export formatting |
-| [inference.py](backend/app/adapters/inference/inference.py#L772) | 772 | `json.dumps({"error": str(e)})` | **Review** - Error response in retrieval tool |
-
-#### Bucket UUID to String Coercion
-
-| File | Line | Context |
-|------|------|---------|
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L411) | 411 | `d["bucket"] = str(d["bucket"])` - Converting for Cosmos storage |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1058) | 1058 | `str(bucket)` - Partition key construction |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1272) | 1272 | `str(it.bucket)` - Partition key for delete |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1762) | 1762 | `str(bucket)` - Partition key construction |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1837) | 1837 | `str(bucket)` - Partition key construction |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1907) | 1907 | `str(gt.bucket)` - Document ID construction |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1975) | 1975 | `str(bucket)` - Item ID construction |
-
-**Assessment:** Bucket-to-string conversion happens at the repository layer for Cosmos DB compatibility. This is appropriate since Cosmos DB partition keys must be strings. The Pydantic models properly type `bucket` as `UUID`, and conversion only happens at persistence boundaries.
-
-### Recommendation
-
-- **No changes required** for JSON serialization patterns in API layer
-- Repository-level `model_dump()` and `str(bucket)` conversions are appropriate for persistence
-- Consider documenting the pattern: "Models remain typed; string conversion only at persistence boundary"
-
----
-
-## 2. Generic Exception Catches (SA-250)
-
-### Locations in App Code
-
-The codebase has extensive use of generic `Exception` catches. Most are intentional defensive patterns with pragmatic comments, but some could benefit from more specific error types.
-
-#### High-Priority (Cosmos-related operations)
-
-| File | Line | Context | Recommendation |
-|------|------|---------|----------------|
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L113) | 113 | Generic exception in repo | Use `CosmosHttpResponseError` |
-| [container.py](backend/app/container.py#L83) | 83 | Credential building | Keep generic (import failures) |
-| [container.py](backend/app/container.py#L127) | 127 | Search init | Keep generic (optional feature) |
-| [container.py](backend/app/container.py#L260) | 260 | Inference init | Keep generic (optional feature) |
-
-#### API Layer Exception Catches
-
-| File | Line | Context | Recommendation |
-|------|------|---------|----------------|
-| [search.py](backend/app/api/v1/search.py#L22) | 22 | Search endpoint | Add specific error handling |
-| [ground_truths.py](backend/app/api/v1/ground_truths.py#L308) | 308 | Status parsing | Keep generic (data validation) |
-| [ground_truths.py](backend/app/api/v1/ground_truths.py#L483) | 483 | Tag recompute | Add specific error types |
-| [assignments.py](backend/app/api/v1/assignments.py#L149) | 149 | Assignment update | Use `CosmosHttpResponseError` |
-| [assignments.py](backend/app/api/v1/assignments.py#L238) | 238 | Assignment update | Use `CosmosHttpResponseError` |
-| [chat.py](backend/app/api/v1/chat.py#L133) | 133 | Chat endpoint | Commented as safeguard |
-| [chat.py](backend/app/api/v1/chat.py#L151) | 151 | Chat endpoint | Keep generic (multi-service) |
-| [tags.py](backend/app/api/v1/tags.py#L83) | 83 | Tags endpoint | Add specific error types |
-
-#### Startup/Lifecycle (main.py)
-
-The [main.py](backend/app/main.py) file has numerous generic `Exception` catches (lines 77, 80, 114, 130, 155, 162, 170, 175, 197, 229). These are intentional "never block startup" patterns and should remain generic.
-
-#### Codebase Already Using CosmosHttpResponseError
-
-The codebase demonstrates proper usage in several places:
-
-```python
-# cosmos_repo.py
-from azure.cosmos.exceptions import CosmosHttpResponseError, CosmosResourceNotFoundError
-```
-
-Used correctly in:
-- [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L488) - Line 488
-- [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1060) - Line 1060
-- [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1091) - Line 1091
-
-### Recommendation
-
-1. **Keep generic** exceptions in:
-   - Startup/lifecycle code (main.py)
-   - Optional feature initialization (container.py)
-   - Third-party library error handling
-
-2. **Replace with specific** exceptions in:
-   - Repository operations interacting with Cosmos
-   - API endpoints that call Cosmos operations
-   - Use `CosmosHttpResponseError` and `CosmosResourceNotFoundError`
-
----
-
-## 3. Print Statements (SA-245)
-
-### Locations in App Code
-
-Only **2 print statements** exist in the main app code:
-
-| File | Line | Code | Recommendation |
-|------|------|------|----------------|
-| [main.py](backend/app/main.py#L122) | 122 | `print(APP_VERSION)` | Replace with `logger.info("app.version", extra={"version": APP_VERSION})` |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L401) | 401 | `print(item.__repr__())` | Replace with `logger.error("repo.invalid_item", extra={"item": item.__repr__()})` |
-
-### Scripts with Print Statements (Lower Priority)
-
-Scripts in `backend/scripts/` use `print()` extensively for CLI output:
-- `cosmos_container_manager.py` - CLI progress output
-- `cosmos_export_import.py` - Migration logging
-- `delete_cosmos_emulator_dbs.py` - Cleanup status
-- `init_seed_data.py` - Seed data feedback
-
-**Assessment:** Script print statements are appropriate for CLI tools and don't need conversion.
-
----
-
-## 4. Logging Patterns Analysis
-
-### Current Architecture
-
-The codebase has a mature logging infrastructure in [app/core/logging.py](backend/app/core/logging.py):
-
-#### Key Components
-
-1. **Setup Function** (`setup_logging`):
-   - Configures root logger with structured format
-   - Suppresses noisy Azure SDK logs
-   - Format: `%(asctime)s %(levelname)s %(name)s user=%(user_id)s %(message)s`
-
-2. **Trace Context Filter** (`_TraceContextFilter`):
-   - Injects `trace_id`, `span_id`, `user_id` into every log record
-   - Integrates with OpenTelemetry when available
-
-3. **User Identity Context**:
-   - `ContextVar` for current user ID
-   - `set_current_user()` / `clear_current_user()` functions
-   - Middleware automatically populates from Easy Auth or headers
-
-4. **Log Record Factory** (`_install_log_record_factory`):
-   - Custom factory ensures `user_id` attribute always exists
-   - Prevents `KeyError` when using `extra={"user_id": ...}`
-
-### The "Extra Field" Pattern (SA-245)
-
-The `extra={}` parameter is used throughout for structured logging:
-
-```python
-# Example from assignment_service.py
-logger.info(
-    "self_assign.assigned",
-    extra=self._log_context(it.id, it.datasetName),
-)
-
-# Helper method creates consistent context
-def _log_context(self, item_id: str | None = None, dataset: str | None = None) -> dict[str, str]:
-    context: dict[str, str] = {}
-    if item_id:
-        context["item_id"] = item_id
-    if dataset:
-        context["dataset"] = dataset
-    return context
-```
-
-#### Locations Using `extra={}` Pattern
-
-| File | Count | Notes |
-|------|-------|-------|
-| [assignment_service.py](backend/app/services/assignment_service.py) | 14 | Consistent `_log_context()` helper |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py) | 8+ | Various repo operations |
-| [search_service.py](backend/app/services/search_service.py) | 1 | Search results |
-| [tagging_service.py](backend/app/services/tagging_service.py) | 1 | Tag collision warning |
-| [validation_service.py](backend/app/services/validation_service.py) | 2 | Validation logging |
-
-### Current Issues with Extra Pattern
-
-1. **Reserved field collision**: The `user_id` field is reserved by the log record factory. Using `extra={"user_id": ...}` would cause issues (documented in [assignment_service.py](backend/app/services/assignment_service.py#L27-L34)).
-
-2. **Inconsistent key naming**: Some use `item_id`, others use `itemId`; some use `count`, others use `candidate_count`.
-
-3. **Missing context helpers**: Only `AssignmentService` has a `_log_context()` helper; other services construct extra dicts inline.
-
-### Recommendations
-
-1. **Standardize key names** across all services (snake_case recommended)
-2. **Create shared logging context helper** in `app/core/logging.py`
-3. **Document reserved keys** (`user_id`, `trace_id`, `span_id`)
-4. **Consider structured logging library** (e.g., `structlog`) for better JSON output in production
-
----
-
-## 5. Summary of Required Changes
-
-### Immediate (Low Effort)
-
-| Priority | File | Change |
-|----------|------|--------|
-| High | [main.py#L122](backend/app/main.py#L122) | Replace `print(APP_VERSION)` with logger |
-| High | [cosmos_repo.py#L401](backend/app/adapters/repos/cosmos_repo.py#L401) | Replace `print(item.__repr__())` with logger |
-
-### Medium-Term (Moderate Effort)
-
-| Priority | Scope | Change |
-|----------|-------|--------|
-| Medium | API endpoints | Replace generic `Exception` with `CosmosHttpResponseError` where appropriate |
-| Medium | Logging | Standardize extra field key naming convention |
-| Low | Logging | Create shared `_log_context()` helper |
-
-### No Changes Required
-
-- Pydantic model return patterns (already correct)
-- Bucket UUID-to-string conversion (appropriate at persistence layer)
-- Generic exceptions in startup/lifecycle code
-- Print statements in CLI scripts
-
----
-
-## Appendix: Files Referenced
-
-- [backend/app/main.py](backend/app/main.py)
-- [backend/app/core/logging.py](backend/app/core/logging.py)
-- [backend/app/container.py](backend/app/container.py)
-- [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py)
-- [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py)
-- [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py)
-- [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py)
-- [backend/app/api/v1/search.py](backend/app/api/v1/search.py)
-- [backend/app/api/v1/chat.py](backend/app/api/v1/chat.py)
-- [backend/app/api/v1/tags.py](backend/app/api/v1/tags.py)
diff --git a/.copilot-tracking/subagent/20260122/concurrency-control-research.md b/.copilot-tracking/subagent/20260122/concurrency-control-research.md
deleted file mode 100644
index a742fa6..0000000
--- a/.copilot-tracking/subagent/20260122/concurrency-control-research.md
+++ /dev/null
@@ -1,212 +0,0 @@
----
-topic: concurrency-control
-jtbd: JTBD-008
-date: 2026-01-22
-status: complete
----
-
-# Research: Concurrency Control
-
-## Context
-
-The concurrency control mechanism prevents race conditions during simultaneous updates. This research examines how GTC handles concurrent modifications to ground-truth items and assignments, identifies potential race conditions, and documents Azure Cosmos DB's concurrency mechanisms.
-
-## Sources Consulted
-
-### Codebase
-
-- [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py): Main Cosmos DB repository implementation with ETag-based optimistic concurrency
-- [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py): Assignment service with self-assign workflow
-- [backend/app/api/v1/assignments.py](backend/app/api/v1/assignments.py): Assignment API endpoints with ETag enforcement
-- [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py): Ground-truth API endpoints with ETag enforcement
-- [backend/docs/user-self-serve-plan.md](backend/docs/user-self-serve-plan.md): Design document for concurrent assignment handling
-
-### Documentation
-
-- [Azure Cosmos DB: Transactions and Optimistic Concurrency Control](https://learn.microsoft.com/en-us/azure/cosmos-db/database-transactions-optimistic-concurrency): Official Microsoft documentation on ETag-based OCC
-- [specs/assignment-workflow.md](specs/assignment-workflow.md): Spec documenting concurrency requirements (NFR-001)
-- [specs/data-persistence.md](specs/data-persistence.md): Spec documenting ETag enforcement requirement (FR-005)
-- [backend/CODEBASE.md](backend/CODEBASE.md): Documents ETag concurrency conventions
-
-### PR Review Comments
-
-- PR #21 review comments (URLs returned 404 - repository may be private or comments deleted)
-
-## Key Findings
-
-### 1. ETag-Based Optimistic Concurrency Is Implemented
-
-GTC uses Azure Cosmos DB's native `_etag` system property for optimistic concurrency control:
-
-- **All write paths require ETag**: Both `assignments.py` and `ground_truths.py` enforce ETag via `If-Match` header or `etag` body field
-- **HTTP 412 on mismatch**: Returns "ETag mismatch" when server ETag differs from client-provided ETag
-- **Conditional replace**: Uses `MatchConditions.IfNotModified` with `replace_item()` ([cosmos_repo.py#L854-L870](backend/app/adapters/repos/cosmos_repo.py#L854-L870))
-
-### 2. Assignment Uses Patch with Filter Predicate (Production)
-
-For production Cosmos DB, assignments use atomic patch operations with `filter_predicate`:
-
-```python
-# From cosmos_repo.py _assign_to_with_patch()
-filter_predicate = (
-    f"FROM c WHERE (c.assignedTo = null OR c.assignedTo = '' "
-    f"OR c.assignedTo = '{user_id}' OR c.status != 'draft')"
-)
-```
-
-This atomically enforces that items can only be assigned if:
-- Item is unassigned (`assignedTo = null` or empty)
-- Item is already assigned to requesting user
-- Item is not in draft state (allowing re-assignment of completed items)
-
-### 3. Emulator Uses Read-Modify-Replace Pattern
-
-For Cosmos DB emulator (which doesn't support `filter_predicate`), GTC falls back to a read-modify-replace pattern ([cosmos_repo.py#L1322-L1378](backend/app/adapters/repos/cosmos_repo.py#L1322-L1378)):
-
-```python
-# Conditional check happens in application code
-can_assign = (
-    not current_assigned_to
-    or current_assigned_to == ""
-    or current_assigned_to == user_id
-    or current_status != GroundTruthStatus.draft.value
-)
-```
-
-**Risk**: The emulator path has a TOCTOU window between read and replace.
-
-### 4. Assignment Document Cleanup Is Non-Atomic
-
-When assignments complete (approve/skip/delete), the workflow:
-1. Updates GroundTruthItem (clears `assignedTo`)
-2. Deletes AssignmentDocument (separate operation)
-
-If step 2 fails, orphaned assignment docs may exist. The code handles this gracefully by logging errors but not failing the request ([assignments.py#L220-L240](backend/app/api/v1/assignments.py#L220-L240)).
-
-### 5. Self-Assign Handles Contention via Retry
-
-The self-assign workflow ([assignment_service.py#L36-L101](backend/app/services/assignment_service.py#L36-L101)):
-- Samples candidates with 2x overfetch to handle contention
-- Retries once with exclusion list if initial pass is short
-- Individual assignment failures don't stop the batch
-
-## Race Condition Risks
-
-| Operation | Risk | Current Mitigation | Recommended Fix |
-|-----------|------|-------------------|-----------------|
-| Ground-truth update | Lost update if two users modify same item | ETag required on all writes; 412 on mismatch | **Adequate** - correctly implemented |
-| Self-serve assignment | Two users claim same item | Patch with `filter_predicate` (atomic) | **Adequate for production**; emulator has TOCTOU |
-| Single-item assign | Two users click assign simultaneously | Validates `assignedTo` before `assign_to()` | **Adequate** - `assign_to()` is atomic in production |
-| Status transition | Concurrent approve/skip/delete | ETag enforced; separate users blocked by ownership | **Adequate** - ownership + ETag |
-| Assignment doc cleanup | Orphaned docs if delete fails | Best-effort delete; logs error | **Low risk** - docs cleaned on next user query |
-| Curation instructions | Two users update dataset instructions | ETag-based conditional replace | **Adequate** |
-| Emulator assignment | TOCTOU between read and replace | None (emulator limitation) | Accept risk or use stored procedure |
-
-## Assignment Workflow Analysis
-
-### Current Flow
-
-```
-┌──────────────────────────────────────────────────────────────────┐
-│                     Self-Serve Assignment                         │
-├──────────────────────────────────────────────────────────────────┤
-│ 1. sample_unassigned(limit * 2)                                  │
-│    └─> Query for draft/skipped items where assignedTo is null    │
-│                                                                  │
-│ 2. For each candidate:                                           │
-│    ├─ assign_to(item_id, user_id)                                │
-│    │   └─> Patch with filter_predicate (atomic)                  │
-│    │       - Success: returns True                                │
-│    │       - 412/conflict: returns False                          │
-│    │                                                              │
-│    └─ If success: upsert_assignment_doc()                        │
-│        └─> Creates materialized view doc in assignments container │
-│                                                                  │
-│ 3. Retry once with exclude_ids if still below limit              │
-└──────────────────────────────────────────────────────────────────┘
-```
-
-### Race Scenarios
-
-**Scenario A: Two users request assignments simultaneously**
-- Both query `sample_unassigned()` and get overlapping candidate sets
-- Each calls `assign_to()` with `filter_predicate`
-- Cosmos DB ensures only one succeeds per item
-- Losing user's request returns `False`, moves to next candidate
-- **Result**: Safe - atomic at database level
-
-**Scenario B: User A assigns while User B updates same item**
-- User A holds item with ETag `E1`
-- User B assigns item (changes `assignedTo`)
-- User A submits update with `E1`
-- Cosmos rejects with 412 (ETag `E2` now on server)
-- **Result**: Safe - ETag prevents lost update
-
-**Scenario C: Two tabs approve same item**
-- Tab 1 and Tab 2 both load item with ETag `E1`
-- Tab 1 approves -> succeeds, ETag becomes `E2`
-- Tab 2 approves with `E1` -> 412 Precondition Failed
-- **Result**: Safe - user sees conflict error
-
-**Scenario D (Emulator only): Assignment TOCTOU**
-- User A reads item (unassigned)
-- User B reads item (unassigned)
-- User A writes `assignedTo=A` (succeeds)
-- User B writes `assignedTo=B` (succeeds - no ETag check)
-- **Result**: User A's assignment lost
-- **Mitigation**: Emulator is development-only; production uses atomic patch
-
-## Azure Cosmos DB Concurrency Mechanisms
-
-From official documentation:
-
-### 1. Optimistic Concurrency Control (OCC)
-- Every item has system-generated `_etag` property
-- Updated automatically on every write
-- Use `If-Match` header with `_etag` value for conditional writes
-- Server returns 412 Precondition Failed on mismatch
-
-### 2. Patch Operations with Filter Predicate
-- Atomic conditional update in single round-trip
-- Filter evaluated server-side before applying patch
-- Returns 412 if filter doesn't match
-
-### 3. Stored Procedures
-- ACID transactions within a logical partition
-- Automatic rollback on exception
-- Useful for multi-item atomic operations
-
-### 4. Status Code Summary
-
-| Code | Meaning | Retry? |
-|------|---------|--------|
-| 409 | Conflict (duplicate ID or unique constraint) | No |
-| 412 | Precondition Failed (ETag mismatch) | Read-then-retry |
-| 449 | Transient write conflict | Yes with backoff |
-
-## Recommendations for Spec
-
-### Must Include
-
-1. **ETag enforcement on all writes**: Document that all update/delete operations require valid ETag; missing or mismatched ETag returns HTTP 412
-2. **Assignment atomicity**: Document that production uses Cosmos DB patch with `filter_predicate` for atomic assignment
-3. **Ownership enforcement**: Document that only the assigned user can modify items in draft state
-4. **Error handling contract**: Define stable error codes for 412 (ETag mismatch) and 409 (assignment conflict)
-
-### Should Include
-
-5. **Emulator limitations**: Note that emulator path has reduced concurrency guarantees (acceptable for development)
-6. **Assignment document consistency**: Document that assignment docs are best-effort and may be orphaned temporarily
-7. **Self-assign retry behavior**: Document overfetch and retry strategy for contention handling
-
-### Nice to Have
-
-8. **Monitoring guidance**: Recommend logging 412/409 rates to detect contention hotspots
-9. **Client retry guidance**: Recommend exponential backoff on 412 with fresh read before retry
-10. **Future: Stored procedure for multi-item transactions**: If cross-item atomicity needed (e.g., assignment + assignment doc creation), consider stored procedure
-
-## Open Questions
-
-1. Should the spec define a maximum retry count for clients on 412?
-2. Is orphaned assignment document cleanup needed as a background job?
-3. Should the emulator path use ETag-based replace instead of unconditional replace for better parity?
diff --git a/.copilot-tracking/subagent/20260122/cosmos-indexing-research.md b/.copilot-tracking/subagent/20260122/cosmos-indexing-research.md
deleted file mode 100644
index acd2524..0000000
--- a/.copilot-tracking/subagent/20260122/cosmos-indexing-research.md
+++ /dev/null
@@ -1,259 +0,0 @@
----
-topic: cosmos-indexing
-jtbd: JTBD-008
-date: 2026-01-22
-status: complete
----
-
-# Research: Cosmos Indexing
-
-## Context
-
-The indexing strategy limits indexed fields to reduce write RU costs. This research examines the current Cosmos DB indexing policy, identifies queried fields, and recommends optimizations.
-
-## Sources Consulted
-
-### Codebase
-
-- [backend/scripts/indexing-policy.json](../../../backend/scripts/indexing-policy.json): The current indexing policy configuration
-- [backend/app/adapters/repos/cosmos_repo.py](../../../backend/app/adapters/repos/cosmos_repo.py): All Cosmos DB queries and field access patterns
-- [backend/app/domain/models.py](../../../backend/app/domain/models.py): Data model field definitions
-- [backend/scripts/emulator_init.sh](../../../backend/scripts/emulator_init.sh): Container creation with indexing policy
-- [.github/workflows/gtc-cd.yml](../../../.github/workflows/gtc-cd.yml): CI/CD indexing policy application
-
-### Documentation
-
-- [Azure Cosmos DB - Indexing policies](https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy): Comprehensive index configuration guide
-- [Azure Cosmos DB - Optimize request cost](https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-reads-writes): RU optimization best practices
-- [Azure Well-Architected Framework - Cosmos DB](https://learn.microsoft.com/en-us/azure/well-architected/service-guides/cosmos-db): Architecture recommendations
-
-## Key Findings
-
-### 1. Current Indexing Policy Uses Default "Index Everything" Strategy
-
-The current policy at [backend/scripts/indexing-policy.json](../../../backend/scripts/indexing-policy.json) indexes all paths:
-
-```json
-{
-    "indexingMode": "consistent",
-    "automatic": true,
-    "includedPaths": [{ "path": "/*" }],
-    "excludedPaths": [{ "path": "/\"_etag\"/?" }]
-}
-```
-
-**Impact**: Every field in every document is indexed, including large text fields that are never queried (e.g., `answer`, `contextUsedForGeneration`, `content` in refs).
-
-### 2. Eight Composite Indexes Defined
-
-The policy includes composite indexes for sorting operations:
-
-| Composite Index | Purpose | Used? |
-|----------------|---------|-------|
-| `[reviewedAt DESC, id ASC]` | Paginated list sorting | ✅ Yes |
-| `[updatedAt DESC, id ASC]` | Paginated list sorting | ✅ Yes |
-| `[reviewedAt ASC, id ASC]` | Ascending sort variant | ✅ Yes |
-| `[status ASC, reviewedAt DESC, id ASC]` | Filtered + sorted queries | ✅ Yes |
-| `[totalReferences ASC, id ASC]` | Reference count sorting | ✅ Yes |
-| `[totalReferences DESC, id ASC]` | Reference count sorting | ✅ Yes |
-| `[status ASC, totalReferences ASC, id ASC]` | Filtered + sorted by refs | ✅ Yes |
-| `[status ASC, totalReferences DESC, id ASC]` | Filtered + sorted by refs | ✅ Yes |
-
-All composite indexes appear to be actively used by the `_build_secure_sort_clause` method.
-
-### 3. Fields Actually Used in Queries
-
-Analysis of [cosmos_repo.py](../../../backend/app/adapters/repos/cosmos_repo.py) reveals these field access patterns:
-
-#### Filter Fields (WHERE clauses)
-
-| Field | Query Pattern | Frequency |
-|-------|--------------|-----------|
-| `docType` | Equality filter | Every query |
-| `status` | Equality filter | High |
-| `datasetName` | Equality/STARTSWITH | High |
-| `id` | Equality/STARTSWITH | High |
-| `assignedTo` | Equality/IS_NULL | Medium |
-| `manualTags` | ARRAY_CONTAINS | Medium |
-| `computedTags` | ARRAY_CONTAINS | Medium |
-| `refs[].url` | EXISTS + CONTAINS (subquery) | Low |
-| `history[].refs[].url` | EXISTS + CONTAINS (nested) | Low |
-
-#### Sort Fields (ORDER BY clauses)
-
-| Field | Direction |
-|-------|-----------|
-| `reviewedAt` | ASC, DESC |
-| `updatedAt` | DESC |
-| `totalReferences` | ASC, DESC |
-| `id` | ASC (secondary sort) |
-| `datasetName` | ASC (list_datasets) |
-
-#### Read-Only Fields (Never Filtered/Sorted)
-
-These fields are fetched but never appear in WHERE or ORDER BY:
-
-- `answer` (large text)
-- `synthQuestion`, `editedQuestion` (text)
-- `contextUsedForGeneration` (large text)
-- `contextSource`, `modelUsedForGeneration` (text)
-- `comment` (text)
-- `refs[].content` (large text, often base64-encoded)
-- `refs[].keyExcerpt`, `refs[].title` (text)
-- `history[].msg` (text)
-- `semanticClusterNumber`, `weight`, `samplingBucket`, `questionLength` (numeric)
-- `schemaVersion`, `bucket` (metadata)
-- `assignedAt`, `updatedBy` (audit fields)
-
-### 4. Partition Key Strategy
-
-The container uses MultiHash hierarchical partition key: `[/datasetName, /bucket]`
-
-**Important**: Per Microsoft documentation, partition key paths are NOT automatically indexed even with `/*`. They must be explicitly included for efficient filtering queries.
-
-### 5. Full-Text Indexes Not Configured
-
-The `fullTextIndexes` array is empty. The [keyword-search-research.md](./keyword-search-research.md) recommends adding full-text indexes for `synthQuestion`, `editedQuestion`, `answer`.
-
-## Current State
-
-### Indexing Policy Summary
-
-- **Mode**: Consistent (synchronous indexing)
-- **Strategy**: Index all paths (`/*`)
-- **Exclusions**: Only `_etag`
-- **Composite indexes**: 8 defined, all actively used
-- **Full-text indexes**: None
-- **Vector indexes**: None
-
-### Estimated Storage Overhead
-
-With `/*` indexing and large text fields:
-- `answer`: Up to several KB per item
-- `contextUsedForGeneration`: Can be large
-- `refs[].content`: Often thousands of characters
-- `history[].msg`: Variable, can be large
-
-Index size could be **50-100%+ of data size** due to indexing these large text fields.
-
-## Query Analysis
-
-### Query Efficiency Assessment
-
-| Query Type | Indexed Fields Used | Efficiency |
-|-----------|---------------------|------------|
-| Paginated list | docType, status, reviewedAt | ✅ Optimal with composite |
-| Dataset filter | datasetName | ✅ Efficient |
-| ID search | id (STARTSWITH) | ✅ Efficient |
-| Tag filter | manualTags, computedTags | ⚠️ ARRAY_CONTAINS has limitations |
-| Ref URL search | refs[].url | ⚠️ EXISTS subquery, in-memory for emulator |
-| Assignment queries | status, assignedTo | ✅ Efficient |
-| Stats (count) | status | ✅ Efficient |
-
-### Fields Indexed But Never Queried
-
-These paths are indexed but provide no query benefit:
-
-1. `/answer/?` - Large text, never filtered
-2. `/synthQuestion/?` - Never filtered (could benefit from full-text)
-3. `/editedQuestion/?` - Never filtered (could benefit from full-text)
-4. `/contextUsedForGeneration/?` - Never filtered
-5. `/contextSource/?` - Never filtered
-6. `/modelUsedForGeneration/?` - Never filtered
-7. `/comment/?` - Never filtered
-8. `/refs/[]/content/?` - Never filtered
-9. `/refs/[]/keyExcerpt/?` - Never filtered
-10. `/refs/[]/title/?` - Never filtered
-11. `/history/[]/msg/?` - Never filtered
-12. `/history/[]/role/?` - Never filtered
-13. `/semanticClusterNumber/?` - Never filtered
-14. `/weight/?` - Never filtered
-15. `/samplingBucket/?` - Never filtered
-16. `/questionLength/?` - Never filtered
-17. `/schemaVersion/?` - Never filtered
-18. `/assignedAt/?` - Never filtered
-19. `/updatedBy/?` - Never filtered
-20. `/curationInstructions/?` - Never filtered
-
-## Recommendations for Spec
-
-### 1. Switch to Explicit Inclusion Strategy
-
-Instead of `/*`, explicitly include only queried paths:
-
-```json
-{
-    "indexingMode": "consistent",
-    "automatic": true,
-    "includedPaths": [
-        { "path": "/docType/?" },
-        { "path": "/status/?" },
-        { "path": "/datasetName/?" },
-        { "path": "/id/?" },
-        { "path": "/assignedTo/?" },
-        { "path": "/reviewedAt/?" },
-        { "path": "/updatedAt/?" },
-        { "path": "/totalReferences/?" },
-        { "path": "/manualTags/[]" },
-        { "path": "/computedTags/[]" },
-        { "path": "/refs/[]/url/?" }
-    ],
-    "excludedPaths": [
-        { "path": "/*" }
-    ]
-}
-```
-
-**Estimated RU savings**: 20-40% reduction in write RU costs based on Microsoft documentation stating that write costs correlate directly with indexed property count.
-
-### 2. Keep Existing Composite Indexes
-
-All 8 composite indexes are actively used. No changes needed.
-
-### 3. Add Missing Index for tagCount (Future)
-
-Per [explorer-sorting-research.md](./explorer-sorting-research.md), add composite index for `tagCount` sorting when that feature is implemented.
-
-### 4. Consider Full-Text Indexes (Future)
-
-Per [keyword-search-research.md](./keyword-search-research.md), add full-text indexes when implementing search:
-
-```json
-{
-    "fullTextIndexes": [
-        { "path": "/synthQuestion" },
-        { "path": "/editedQuestion" },
-        { "path": "/answer" }
-    ]
-}
-```
-
-### 5. Monitor and Measure
-
-- Use Azure Monitor to track RU consumption before/after policy changes
-- Monitor index transformation progress during policy updates
-- Test query performance with the new policy before production deployment
-
-### 6. Implementation Approach
-
-1. **Test in emulator first**: Apply new policy to dev/test environments
-2. **Run query performance tests**: Verify all queries still perform acceptably
-3. **Apply incrementally**: Index transformation happens online but consumes RUs
-4. **Monitor transformation**: Track progress via SDK or portal
-
-## Potential RU Savings
-
-Based on Microsoft documentation:
-
-- **Write operations**: "Inserting a 1-KB item without indexing costs around ~5.5 RUs. Replacing an item costs two times the charge."
-- **Indexing overhead**: Each indexed property adds to write RU cost
-- **Large text fields**: Indexing multi-KB text fields significantly increases write costs
-
-**Conservative estimate**: Excluding 15-20 never-queried paths (especially large text fields) could reduce write RUs by **20-40%**.
-
-## References
-
-- [Indexing policies in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy)
-- [Optimize request cost in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-reads-writes)
-- [Composite indexes in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#composite-indexes)
-- [SA-242 Story](https://jira.example.com/browse/SA-242)
diff --git a/.copilot-tracking/subagent/20260122/curation-editor-research.md b/.copilot-tracking/subagent/20260122/curation-editor-research.md
deleted file mode 100644
index 957f63e..0000000
--- a/.copilot-tracking/subagent/20260122/curation-editor-research.md
+++ /dev/null
@@ -1,53 +0,0 @@
----
-topic: curation-editor
-jtbd: JTBD-001
-date: 2026-01-22
-status: complete
----
-
-# Research: Curation Editor
-
-## Context
-
-The curation editor provides the main workflow to edit ground-truth content (single-turn or multi-turn), apply tags, and transition items through draft/approved/skipped/deleted states.
-
-## Sources Consulted
-
-### URLs
-- (None)
-
-### Codebase
-- [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts): Maps single-turn items into a multi-turn history format and maps references across top-level and per-turn refs.
-- [frontend/src/services/tags.ts](frontend/src/services/tags.ts): Defines tag schema fetch and exclusive-group validation in the UI.
-
-### Documentation
-- [.copilot-tracking/research/20260121-high-level-requirements-research.md](.copilot-tracking/research/20260121-high-level-requirements-research.md): Consolidates editor and multi-turn behavior requirements.
-- [frontend/CODEBASE.md](frontend/CODEBASE.md): Documents the curation workspace layout and approval gating constraints.
-- [backend/CODEBASE.md](backend/CODEBASE.md): Documents API behaviors, including camelCase output and ETag concurrency.
-- [backend/docs/multi-turn-refs.md](backend/docs/multi-turn-refs.md): Documents backward-compatible storage and editing semantics for multi-turn refs.
-- [backend/docs/tagging_plan.md](backend/docs/tagging_plan.md): Documents tag normalization expectations.
-
-## Key Findings
-
-1. The UI treats all items as multi-turn in its internal model, converting legacy single-turn records into an initial two-message history.
-2. The editor supports both top-level references and per-history-turn references, and maps them into a unified reference list for user workflows.
-3. Approval is gated by reference completeness rules (at least one selected reference, all references visited, key paragraph constraints).
-4. Tagging includes manual and computed tags, and the UI enforces “exclusive group” constraints based on backend-provided schema.
-5. Documentation includes some conflicts (for example, tag write paths); when code does not reflect a doc claim, it is treated as doc-only.
-
-## Existing Patterns
-
-| Pattern | Location | Relevance |
-|---------|----------|-----------|
-| Single-turn to multi-turn normalization | [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts) | Defines current UI behavior and backward compatibility |
-| Exclusive tag group validation | [frontend/src/services/tags.ts](frontend/src/services/tags.ts) | Defines validation expectations for tag selection |
-
-## Open Questions
-
-- (None)
-
-## Recommendations for Spec
-
-- Specify the multi-turn normalization rule as a frontend behavior and compatibility expectation.
-- Specify tag behaviors in terms of observable constraints (exclusive groups, manual vs computed sets).
-- Specify approval gating rules as UX invariants.
diff --git a/.copilot-tracking/subagent/20260122/data-persistence-research.md b/.copilot-tracking/subagent/20260122/data-persistence-research.md
deleted file mode 100644
index 1e56daf..0000000
--- a/.copilot-tracking/subagent/20260122/data-persistence-research.md
+++ /dev/null
@@ -1,53 +0,0 @@
----
-topic: data-persistence
-jtbd: JTBD-001
-date: 2026-01-22
-status: complete
----
-
-# Research: Data Persistence
-
-## Context
-
-The persistence layer abstracts storage behind a repository protocol with Azure Cosmos DB as the primary backend.
-
-## Sources Consulted
-
-### URLs
-- (None)
-
-### Codebase
-- [backend/app/adapters/repos/base.py](backend/app/adapters/repos/base.py): Defines the `GroundTruthRepo` protocol that abstracts storage operations.
-- [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py): Implements the Cosmos DB repository.
-- [backend/app/main.py](backend/app/main.py): Shows lifespan initialization for Cosmos repo; does not block startup on failure.
-
-### Documentation
-- [.copilot-tracking/research/20260121-high-level-requirements-research.md](.copilot-tracking/research/20260121-high-level-requirements-research.md): Consolidates persistence and Cosmos emulator requirements.
-- [backend/CODEBASE.md](backend/CODEBASE.md): Documents layered architecture and configuration for Cosmos.
-- [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md): Documents emulator query limitations and test gating.
-- [backend/docs/cosmos-emulator-unicode-workaround.md](backend/docs/cosmos-emulator-unicode-workaround.md): Documents optional Unicode escape workaround.
-
-## Key Findings
-
-1. The backend defines a `GroundTruthRepo` protocol to abstract storage, enabling in-memory and Cosmos backends.
-2. The Cosmos implementation is the production backend and is initialized during app lifespan.
-3. Startup does not block if Cosmos initialization fails; this supports emulator-not-ready scenarios.
-4. The Cosmos emulator has query limitations (for example, lack of `ARRAY_CONTAINS`), and incompatible tests are gated/skipped.
-5. An optional Unicode escape workaround exists for emulator-only invalid escape failures.
-
-## Existing Patterns
-
-| Pattern | Location | Relevance |
-|---------|----------|-----------|
-| Repository protocol abstraction | [backend/app/adapters/repos/base.py](backend/app/adapters/repos/base.py) | Defines interface for pluggable storage |
-| Non-blocking lifespan init | [backend/app/main.py](backend/app/main.py) | Supports graceful degradation when emulator is unavailable |
-
-## Open Questions
-
-- (None)
-
-## Recommendations for Spec
-
-- Specify that storage is abstracted via a repository protocol with Cosmos as the primary backend.
-- Specify non-blocking startup behavior when Cosmos is unavailable.
-- Specify that emulator-incompatible behaviors are gated or skipped in tests.
diff --git a/.copilot-tracking/subagent/20260122/dependency-injection-research.md b/.copilot-tracking/subagent/20260122/dependency-injection-research.md
deleted file mode 100644
index 278d74d..0000000
--- a/.copilot-tracking/subagent/20260122/dependency-injection-research.md
+++ /dev/null
@@ -1,260 +0,0 @@
-# Dependency Injection Research: SA-238
-
-**Research Date:** 2026-01-22
-**Topic:** Refactoring to use FastAPI dependency injection for config and cosmos
-
----
-
-## 1. Current Architecture Analysis
-
-### 1.1 Container.py Overview
-
-The [container.py](backend/app/container.py) file implements a **Service Locator** pattern (not true DI):
-
-```python
-class Container:
-    repo: GroundTruthRepo
-    assignment_service: AssignmentService
-    search_service: SearchService
-    snapshot_service: SnapshotService
-    curation_service: CurationService
-    tag_registry_service: TagRegistryService
-    # ... more services
-
-container = Container()  # Global singleton
-```
-
-**Key characteristics:**
-
-- Single global `container` instance created at module import time
-- Services initialized lazily via explicit `init_*()` methods
-- Cosmos repo created via `init_cosmos_repo(db_name)` or `startup_cosmos(db_name)`
-- Services store direct references to other services and repos
-
-### 1.2 Service Instantiation Flow
-
-1. **App startup** ([main.py](backend/app/main.py#L60-L78)):
-   - `lifespan()` async context manager calls `container.startup_cosmos()`
-   - This creates repo instances and wires services
-
-2. **Container initialization methods**:
-   - `init_cosmos_repo()` - Creates Cosmos repo and dependent services
-   - `init_search()` - Configures Azure AI Search adapter
-   - `init_chat()` - Configures agent inference service
-
-### 1.3 Endpoint Access Pattern
-
-Endpoints access services via **direct module import** of the global container:
-
-```python
-# In every API router file
-from app.container import container
-
-@router.post("")
-async def import_bulk(...):
-    result = await container.repo.import_bulk_gt(gt_items, buckets=buckets)
-```
-
-This pattern repeats across all 16+ files that import `container`.
-
----
-
-## 2. Existing FastAPI `Depends()` Usage
-
-The codebase **already uses** `Depends()` extensively for authentication:
-
-| File | Usage Pattern |
-|------|---------------|
-| [ground_truths.py](backend/app/api/v1/ground_truths.py) | `user: UserContext = Depends(get_current_user)` |
-| [assignments.py](backend/app/api/v1/assignments.py) | `user: UserContext = Depends(get_current_user)` |
-| [search.py](backend/app/api/v1/search.py) | `user: UserContext = Depends(get_current_user)` |
-| [chat.py](backend/app/api/v1/chat.py) | `principal: Principal = Depends(require_user)` |
-| [main.py](backend/app/main.py#L181) | `dependencies=[Depends(require_user)]` on routes |
-
-**24+ usages** of `Depends()` found, all for authentication.
-
-**No services** are currently injected via `Depends()`.
-
----
-
-## 3. Configuration Access Pattern
-
-### 3.1 Settings Module ([config.py](backend/app/core/config.py))
-
-Configuration uses **Pydantic Settings** with a global singleton:
-
-```python
-class Settings(BaseSettings):
-    model_config = SettingsConfigDict(env_prefix="GTC_", ...)
-    
-    COSMOS_ENDPOINT: str | None = None
-    COSMOS_KEY: SecretStr | None = None
-    # ... 60+ settings
-
-settings = Settings()  # Global singleton
-```
-
-### 3.2 Settings Access
-
-Settings are accessed via direct import throughout:
-
-```python
-from app.core.config import settings
-
-# Container uses it
-if settings.COSMOS_ENDPOINT:
-    ...
-    
-# Services use it
-if settings.CHAT_ENABLED:
-    ...
-```
-
----
-
-## 4. Pain Points Identified
-
-### 4.1 Testing Complexity
-
-**Integration tests** require extensive fixtures to manage container state:
-
-From [tests/integration/conftest.py](backend/tests/integration/conftest.py#L87-L124):
-
-```python
-@pytest.fixture(scope="function")
-async def configure_repo_for_test_db(require_cosmos_backend, test_db_name, init_emulator_containers):
-    # Close any previous Cosmos async client
-    try:
-        prev_repo = getattr(container, "repo", None)
-        client = getattr(prev_repo, "_client", None)
-        if client is not None:
-            # Manual cleanup...
-    except Exception:
-        pass
-    container.init_cosmos_repo(db_name=test_db_name)
-```
-
-**Unit tests** create fake repos and directly mutate container:
-
-From [tests/unit/conftest.py](backend/tests/unit/conftest.py#L59-L130):
-
-```python
-container.repo = _NoopMemoryRepo()
-container.assignment_service = AssignmentService(container.repo)
-container.snapshot_service = SnapshotService(container.repo, ...)
-# ... manual wiring of all services
-```
-
-### 4.2 Service Coupling
-
-The [validation_service.py](backend/app/services/validation_service.py) directly imports container:
-
-```python
-from app.container import container
-
-async def validate_ground_truth_item(item, valid_tags_cache=None):
-    if valid_tags_cache is None:
-        valid_tags_cache = set(await container.tag_registry_service.list_tags())
-```
-
-This creates a **hidden dependency** that's hard to mock without modifying the global container.
-
-### 4.3 Async Initialization Complexity
-
-Container uses `cast(ServiceType, None)` as placeholder until async init:
-
-```python
-self.repo = cast(GroundTruthRepo, None)
-self.assignment_service = cast(AssignmentService, None)
-```
-
-This leads to potential `None` access if initialization order is wrong.
-
----
-
-## 5. What FastAPI DI Would Provide
-
-### 5.1 Benefits
-
-| Current Approach | FastAPI DI Alternative |
-|------------------|------------------------|
-| Global mutable singleton | Request-scoped or cached dependencies |
-| Manual container wiring in tests | `app.dependency_overrides[dep] = mock` |
-| Import-time coupling | Runtime injection |
-| Settings passed around manually | `Annotated[Settings, Depends(get_settings)]` |
-
-### 5.2 Example Transformation
-
-**Current:**
-```python
-from app.container import container
-
-@router.post("")
-async def import_bulk(items: list[GroundTruthItem]):
-    result = await container.repo.import_bulk_gt(items)
-```
-
-**With FastAPI DI:**
-```python
-def get_repo() -> GroundTruthRepo:
-    return container.repo  # Or create fresh
-
-@router.post("")
-async def import_bulk(
-    items: list[GroundTruthItem],
-    repo: GroundTruthRepo = Depends(get_repo)
-):
-    result = await repo.import_bulk_gt(items)
-```
-
-**Test override:**
-```python
-async def test_import():
-    app.dependency_overrides[get_repo] = lambda: MockRepo()
-    # Test now uses MockRepo without touching global container
-```
-
----
-
-## 6. Assessment
-
-### 6.1 Current Approach Works
-
-The current Service Locator pattern is:
-
-- **Consistent** - Used uniformly across all endpoints
-- **Simple** - One import gives access to all services
-- **Tested** - Extensive test coverage exists
-- **Functional** - No reported bugs related to DI
-
-### 6.2 Migration Complexity
-
-A full FastAPI DI migration would require:
-
-1. Creating `Depends()` functions for each service (~8 services)
-2. Updating all endpoint signatures (~50+ endpoints)
-3. Rewriting test fixtures to use `dependency_overrides`
-4. Managing async initialization differently (lifespan vs per-request)
-
-### 6.3 Recommendation
-
-**Status: Consider deferring or partial adoption**
-
-The current approach is working. Potential improvements without full migration:
-
-1. **Partial adoption**: Use `Depends()` for new endpoints
-2. **Settings injection**: Create `get_settings()` dependency for easier testing
-3. **Service injection for validation_service**: Remove direct container import
-
----
-
-## 7. Summary
-
-| Question | Finding |
-|----------|---------|
-| What does container.py do? | Service Locator with lazy initialization, holds all service singletons |
-| How are services accessed? | Direct import of global `container` instance |
-| What config objects exist? | Single `Settings` Pydantic model, global `settings` instance |
-| Pain points? | Test complexity, service coupling, async init management |
-| FastAPI DI already used? | Yes, but only for auth (`get_current_user`, `require_user`) |
-| Migration worth it? | Partial adoption may be sufficient; full migration is high effort |
diff --git a/.copilot-tracking/subagent/20260122/docs-content-strategy-research.md b/.copilot-tracking/subagent/20260122/docs-content-strategy-research.md
deleted file mode 100644
index 0e9a776..0000000
--- a/.copilot-tracking/subagent/20260122/docs-content-strategy-research.md
+++ /dev/null
@@ -1,250 +0,0 @@
-# Documentation Content Strategy Research
-
-## Overview
-
-This research assesses the current documentation landscape for Ground Truth Curator, identifying audience fit, staleness, and organization recommendations.
-
----
-
-## 1. Documentation Inventory
-
-### Root Level
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| [README.md](README.md) | Developers | **Stub** | Single-line placeholder only |
-| [AGENTS.md](AGENTS.md) | AI Agents | Current | Jujutsu workflow instructions |
-| [BUSINESS_VALUE.md](BUSINESS_VALUE.md) | Stakeholders/SMEs | Current | Value proposition and KPIs |
-
-### Backend (`backend/`)
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| [README.md](backend/README.md) | Developers | **Current** | Comprehensive local setup guide |
-| [CODEBASE.md](backend/CODEBASE.md) | Developers | **Current** | Architecture map, contracts, extension points |
-
-#### Backend Docs (`backend/docs/`)
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| export-pipeline.md | Developers | **Current** | Export API and storage backends |
-| OBSERVABILITY_IMPLEMENTATION.md | Developers/Ops | Current | Telemetry setup |
-| api-write-consolidation-plan.md | Developers | **Stale/Plan** | AI-generated implementation plan |
-| api-write-consolidation-plan.v2.md | Developers | **Stale/Plan** | Superseded plan version |
-| fastapi-implementation-plan.md | Developers | **Stale/Plan** | Original MVP implementation plan |
-| drift_cleanup.md | Developers | **Stale/Plan** | API drift analysis (completed work) |
-| tagging_plan.md | Developers | Partially current | Tag behavior reference |
-| cosmos-emulator-limitations.md | Developers | Current | Emulator workarounds |
-| cosmos-emulator-unicode-workaround.md | Developers | Current | Unicode escape fix |
-| todos.md | Developers | **Stale** | Old MVP checklist |
-| multi-turn-refs.md | Developers | Current | Multi-turn data model |
-| history-tags-feature.md | Developers | Current | History item tags |
-| user-self-serve-plan.md | Developers | **Stale/Plan** | Implemented feature |
-| assign-single-item-endpoint.md | Developers | **Stale/Plan** | Endpoint design doc |
-| pytest-fastapi-cosmos-emulator-best-practices.md | Developers | Current | Testing guidance |
-
-### Frontend (`frontend/`)
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| [README.md](frontend/README.md) | Developers | **Current** | Local dev guide |
-| [CODEBASE.md](frontend/CODEBASE.md) | Developers | **Current** | Architecture map and contracts |
-
-#### Frontend Docs (`frontend/docs/`)
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| CONNECT_TO_BACKEND.md | Developers | Current | API types generation guide |
-| MVP_REQUIREMENTS.md | Developers/SMEs | **Partially stale** | Original MVP checklist (some items done) |
-| REFACTORING_PLAN.md | Developers | **Stale/Plan** | Completed refactor |
-| OBSERVABILITY_IMPLEMENTATION.md | Developers | Current | Frontend telemetry |
-| connecting-e2e-best-practices.md | Developers | Current | E2E testing patterns |
-
-#### Frontend Plans (`frontend/plans/`)
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| multi-turn-curation-plan.md | Developers | **Stale/Plan** | Implementation plan (in progress) |
-| e2e-backend-integration-plan.md | Developers | **Stale/Plan** | Completed integration |
-| playwright-e2e-test-plan.md | Developers | **Stale/Plan** | Test setup plan |
-| keyboard-shortcuts-plan.md | Developers | **Stale/Plan** | Implemented feature |
-| agent-integration-plan.md | Developers | **Stale/Plan** | LLM integration plan |
-| telemetry-observability-plan.md | Developers | **Stale/Plan** | Implemented feature |
-| *-plan.md (remaining) | Developers | **Stale/Plan** | Various implementation plans |
-
-### Docs Folder (`docs/`)
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| ground-truth-curation-reqs.md | Developers/SMEs | **Canonical** | MVP requirements and data model |
-| computed-tags-design.md | Developers | **Current** | Tag architecture and export pipeline |
-| manual-tags-design.md | Developers | Current | Manual tag system |
-| frontend-runtime-configuration.md | Developers | Current | Runtime config |
-| json-export-migration-plan.md | Developers | **Stale/Plan** | Completed migration |
-
-### Specs Folder (`specs/`)
-
-| File | Audience | Status | Notes |
-|------|----------|--------|-------|
-| _index.md | All | **Current** | Spec index by JTBD |
-| assignment-workflow.md | Developers/SMEs | Draft | Current-state spec |
-| explorer-view.md | Developers/SMEs | Draft | Current-state spec |
-| curation-editor.md | Developers/SMEs | Draft | Current-state spec |
-| reference-management.md | Developers/SMEs | Draft | Current-state spec |
-| export-snapshots.md | Developers/SMEs | Draft | Current-state spec |
-| data-persistence.md | Developers | Draft | Cosmos backend spec |
-| observability-operations.md | Developers/Ops | Draft | Health and telemetry spec |
-| *-enhancement specs | Developers | Draft | Future feature specs |
-
----
-
-## 2. Staleness Assessment
-
-### Categories
-
-**Current (Authoritative)**
-- Backend README.md and CODEBASE.md
-- Frontend README.md and CODEBASE.md
-- Export pipeline docs
-- Emulator workarounds
-- Testing best practices
-- Specs index and current-state specs
-
-**Stale/Plan Documents (AI-generated or completed work)**
-- `backend/docs/fastapi-implementation-plan.md` - original MVP plan, now implemented
-- `backend/docs/api-write-consolidation-plan*.md` - API redesign, mostly complete
-- `backend/docs/drift_cleanup.md` - analysis of completed cleanup
-- `backend/docs/user-self-serve-plan.md` - implemented
-- `backend/docs/todos.md` - outdated checklist
-- `frontend/docs/REFACTORING_PLAN.md` - completed refactor
-- `frontend/plans/*.md` - most are completed implementation plans
-- `docs/json-export-migration-plan.md` - completed migration
-
-**Partially Stale**
-- `frontend/docs/MVP_REQUIREMENTS.md` - contains done items mixed with remaining work
-- `docs/ground-truth-curation-reqs.md` - canonical but has outdated "todo" items
-
-### Drift Patterns
-
-1. **AI-generated plans remain after implementation** - Plans in `frontend/plans/` and `backend/docs/` were created to guide implementation but weren't archived after completion.
-
-2. **Checklists not updated** - MVP_REQUIREMENTS.md and todos.md have checkboxes that don't reflect current state.
-
-3. **Multiple versions** - api-write-consolidation-plan.md has v1 and v2 without clear indication which is canonical.
-
----
-
-## 3. Audience Analysis
-
-### Developer Audience
-
-**Well served by:**
-- Backend/frontend README.md - local setup
-- Backend/frontend CODEBASE.md - architecture understanding
-- Export pipeline and emulator docs - specific technical guidance
-- Specs folder - system behavior documentation
-
-**Gaps:**
-- No consolidated "Getting Started" guide across the full stack
-- No API reference (relies on OpenAPI spec)
-- No contribution guide
-- Architecture diagrams scattered or missing
-
-### SME/Curator Audience
-
-**Well served by:**
-- BUSINESS_VALUE.md - value proposition
-- ground-truth-curation-reqs.md - requirements context
-- Current-state specs - system behavior documentation
-
-**Gaps:**
-- **No user guide** - SMEs have no documentation for using the curation UI
-- **No workflow guide** - No step-by-step curation workflow documentation
-- **No onboarding material** - New SMEs must learn by exploration
-
-### Ops/Admin Audience
-
-**Partially served by:**
-- Observability implementation docs
-- Backend README deployment section
-
-**Gaps:**
-- No runbook for production operations
-- No incident response documentation
-- Limited deployment documentation
-
----
-
-## 4. Content Organization Recommendations
-
-### Recommended Structure
-
-```
-docs/
-├── README.md                    # Documentation hub (NEW)
-├── getting-started/
-│   ├── quickstart.md           # Full-stack setup (NEW)
-│   ├── developer-setup.md      # Detailed dev environment
-│   └── sme-onboarding.md       # SME getting started (NEW)
-├── user-guides/
-│   ├── curation-workflow.md    # SME curation guide (NEW)
-│   ├── tagging-guide.md        # How to use tags (NEW)
-│   └── export-guide.md         # Export procedures (NEW)
-├── architecture/
-│   ├── overview.md             # System architecture (NEW)
-│   ├── data-model.md           # Consolidated from reqs
-│   ├── api-reference.md        # Link to OpenAPI
-│   └── backend-internals.md    # From CODEBASE.md
-├── operations/
-│   ├── deployment.md           # Deploy to Azure (NEW)
-│   ├── monitoring.md           # Observability guide
-│   └── troubleshooting.md      # Common issues (NEW)
-├── contributing/
-│   ├── CONTRIBUTING.md         # Contribution guide (NEW)
-│   └── code-conventions.md     # From specs
-└── archive/
-    └── plans/                  # Move completed plans here
-```
-
-### Migration Actions
-
-1. **Create docs hub** - New README.md in docs/ with navigation
-
-2. **Create SME documentation** - Priority: curation-workflow.md and sme-onboarding.md
-
-3. **Archive stale plans** - Move completed implementation plans to `docs/archive/plans/`
-
-4. **Consolidate duplicates** - Merge api-write-consolidation-plan versions
-
-5. **Update checklists** - Either update or archive MVP_REQUIREMENTS.md and todos.md
-
-6. **Promote specs** - Current-state specs are good; link from docs hub
-
----
-
-## 5. Summary
-
-### Current State
-
-| Category | Count | Status |
-|----------|-------|--------|
-| Current/authoritative docs | 15 | Good coverage for developers |
-| Stale plan documents | 12+ | Need archival |
-| SME-focused docs | 0 | **Critical gap** |
-| Ops documentation | 2 | Partial coverage |
-
-### Priorities
-
-1. **High: Create SME user guide** - No documentation for the primary user persona
-2. **High: Archive stale plans** - Reduce confusion about authoritative sources
-3. **Medium: Create docs hub** - Improve discoverability
-4. **Medium: Getting started guide** - Reduce onboarding friction
-5. **Low: Ops runbook** - Needed for production but can follow launch
-
-### Key Findings
-
-- **Developer docs are strong** - README and CODEBASE files provide good guidance
-- **SME docs are absent** - Critical gap for the primary user audience
-- **Plan documents create noise** - 12+ stale plans remain in active locations
-- **Specs are well-organized** - JTBD-based spec structure is effective
-- **No contribution guide** - Missing standard OSS documentation
diff --git a/.copilot-tracking/subagent/20260122/docs-infrastructure-research.md b/.copilot-tracking/subagent/20260122/docs-infrastructure-research.md
deleted file mode 100644
index 006543d..0000000
--- a/.copilot-tracking/subagent/20260122/docs-infrastructure-research.md
+++ /dev/null
@@ -1,167 +0,0 @@
----
-title: Documentation Infrastructure Research
-description: Research findings on current documentation state and MkDocs setup requirements
-author: copilot
-ms.date: 2026-01-22
-status: complete
----
-
-## Summary
-
-The repository has **no existing MkDocs configuration**. Documentation is scattered across multiple locations with no unified build system. Setting up MkDocs requires creating the configuration from scratch.
-
-## Research Findings
-
-### 1. Existing Documentation Files
-
-**Root-level documentation:**
-
-| File | Purpose |
-|------|---------|
-| [README.md](../../../README.md) | Minimal project title only |
-| [AGENTS.md](../../../AGENTS.md) | Jujutsu version control workflow instructions |
-| [BUSINESS_VALUE.md](../../../BUSINESS_VALUE.md) | Business value documentation |
-
-**`docs/` folder (5 files + 1 subfolder):**
-
-| File | Description |
-|------|-------------|
-| computed-tags-design.md | Tag computation design |
-| manual-tags-design.md | Manual tagging design |
-| frontend-runtime-configuration.md | Frontend config guide |
-| ground-truth-curation-reqs.md | Requirements document |
-| json-export-migration-plan.md | Export migration plan |
-| images/ | Image assets |
-| specs/ | Empty subfolder |
-
-**`specs/` folder (26 specification files):**
-
-Organized specifications with an `_index.md` index file covering:
-
-- JTBD-001: Current-state system specs (7 topics)
-- JTBD-002: Curation enhancements (7 topics)
-- JTBD-003: Search and filtering (3 topics)
-- JTBD-004: Data integrity and security (4 topics)
-- JTBD-005: Code quality (4 topics)
-
-**`backend/docs/` folder (17 files):**
-
-Technical documentation including:
-
-- API change checklists and consolidation plans
-- Cosmos emulator documentation and workarounds
-- Feature plans (tagging, history, multi-turn refs)
-- Best practices guides
-
-**`frontend/docs/` folder (5 files):**
-
-- CONNECT_TO_BACKEND.md
-- MVP_REQUIREMENTS.md
-- OBSERVABILITY_IMPLEMENTATION.md
-- REFACTORING_PLAN.md
-- connecting-e2e-best-practices.md
-
-**Component READMEs:**
-
-- [backend/README.md](../../../backend/README.md) - Comprehensive setup guide (~300 lines)
-- [frontend/README.md](../../../frontend/README.md) - Development guide (~100 lines)
-- backend/scripts/README.md
-- scripts/README.md
-
-### 2. MkDocs Configuration Status
-
-**No `mkdocs.yml` exists.** File search returned no results.
-
-### 3. Existing Build Tooling
-
-**Root level:** No package.json exists at repository root.
-
-**`backend/pyproject.toml`:**
-
-- Uses `uv` for package management
-- No documentation-related scripts or dependencies
-- Dependencies: FastAPI, pytest, ruff, black (no mkdocs/sphinx)
-
-**`frontend/package.json`:**
-
-- Standard Vite/React scripts (dev, build, lint, test)
-- No documentation scripts
-- No documentation dependencies
-
-### 4. Documentation Structure Assessment
-
-| Location | File Count | Content Type |
-|----------|------------|--------------|
-| Root | 3 | Project overview |
-| docs/ | 5 | Design docs, requirements |
-| specs/ | 26 | Feature specifications |
-| backend/docs/ | 17 | Technical guides |
-| frontend/docs/ | 5 | Frontend guides |
-| .copilot-tracking/ | 50+ | Research artifacts |
-
-**Total unique documentation files:** ~106 markdown files
-
-## What Needs to Be Set Up
-
-### Required for MkDocs
-
-1. **Create `mkdocs.yml`** at repository root with:
-   - Site metadata (name, description, repo URL)
-   - Theme configuration (recommend Material for MkDocs)
-   - Navigation structure organizing scattered docs
-   - Plugin configuration (search, etc.)
-
-2. **Add MkDocs dependencies** to `backend/pyproject.toml`:
-
-   ```toml
-   [project.optional-dependencies]
-   docs = [
-       "mkdocs>=1.6",
-       "mkdocs-material>=9.5",
-   ]
-   ```
-
-3. **Create navigation structure** to unify:
-   - Root README as landing page
-   - `docs/` as design documentation
-   - `specs/` as specifications section
-   - `backend/docs/` as backend technical docs
-   - `frontend/docs/` as frontend technical docs
-   - Component READMEs as quickstart guides
-
-4. **Add scripts** for build/serve:
-   - `uv run mkdocs serve` for local development
-   - `uv run mkdocs build` for static site generation
-
-### Recommended Navigation Structure
-
-```yaml
-nav:
-  - Home: index.md
-  - Getting Started:
-    - Backend Setup: backend/README.md
-    - Frontend Setup: frontend/README.md
-  - Specifications:
-    - Overview: specs/_index.md
-    - Current State: specs/assignment-workflow.md
-    # ... other specs
-  - Design Docs:
-    - Tags Design: docs/manual-tags-design.md
-    # ... other design docs
-  - Backend Reference:
-    - API Plans: backend/docs/api-write-consolidation-plan.md
-    # ... other backend docs
-  - Frontend Reference:
-    - Connect to Backend: frontend/docs/CONNECT_TO_BACKEND.md
-    # ... other frontend docs
-```
-
-## Key Findings Summary
-
-| Question | Answer |
-|----------|--------|
-| MkDocs configuration exists? | **No** |
-| Documentation build tooling? | **None** |
-| Documentation locations | 5+ scattered locations |
-| Total markdown files | ~106 |
-| Setup complexity | Medium (organize existing content) |
diff --git a/.copilot-tracking/subagent/20260122/dos-prevention-research.md b/.copilot-tracking/subagent/20260122/dos-prevention-research.md
deleted file mode 100644
index 38e33b0..0000000
--- a/.copilot-tracking/subagent/20260122/dos-prevention-research.md
+++ /dev/null
@@ -1,160 +0,0 @@
-# DoS Prevention Research: Bulk Import Endpoint
-
-**Date:** 2026-01-22  
-**Story:** SA-409  
-**Topic:** DoS vulnerability in bulk import endpoint
-
-## Executive Summary
-
-The bulk import endpoint (`POST /v1/ground_truths`) accepts an unbounded list of `GroundTruthItem` objects with **no size validation**. This creates a critical DoS vulnerability where attackers can exhaust server memory/CPU by submitting arbitrarily large payloads. No rate limiting middleware exists in the codebase.
-
-## Research Findings
-
-### 1. Current Bulk Import Endpoint
-
-**Location:** [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L55-L119)
-
-```python
-@router.post("", response_model=ImportBulkResponse)
-async def import_bulk(
-    items: list[GroundTruthItem],  # ← NO SIZE LIMIT
-    user: UserContext = Depends(get_current_user),
-    buckets: int | None = Query(default=None, ge=1, le=50),
-    approve: bool = Query(
-        default=False,
-        description="If true, mark all imported items as approved and set review metadata.",
-    ),
-) -> ImportBulkResponse:
-```
-
-**Confirmed gaps:**
-
-- No `max_length` constraint on the `items` list parameter
-- No validation of list size before processing
-- No request body size limit configured
-- Iterates over entire list twice (ID assignment + validation) before any persistence
-
-### 2. Rate Limiting Libraries for FastAPI
-
-| Library | Description | Pros | Cons |
-|---------|-------------|------|------|
-| **slowapi** | FastAPI-friendly, based on limits | Drop-in, Redis support, decorator-based | Adds dependency |
-| **fastapi-limiter** | Redis-based rate limiting | Async-native | Requires Redis |
-| **starlette-throttle** | Starlette middleware | Simple | Less maintained |
-| **Custom middleware** | Roll your own | Full control, no deps | More code to maintain |
-
-**Recommendation:** `slowapi` - mature, FastAPI-native, supports memory and Redis backends.
-
-### 3. Configuration Patterns in GTC
-
-**Settings location:** [backend/app/core/config.py](backend/app/core/config.py)
-
-The codebase uses `pydantic-settings` with:
-
-- Environment variable prefix: `GTC_`
-- Type-safe settings via `Settings` class
-- Field validation with `Field()` and `model_validator`
-
-**Existing pagination settings pattern to follow:**
-
-```python
-# Pagination settings
-PAGINATION_MAX_LIMIT: int = Field(
-    default=100, description="Maximum items per page for list queries"
-)
-PAGINATION_MIN_LIMIT: int = Field(default=1, description="Minimum items per page")
-PAGINATION_TAG_FETCH_MAX: int = Field(
-    default=500,
-    description="Maximum items to fetch for tag filtering queries (memory safeguard)",
-)
-```
-
-**Recommended new settings:**
-
-```python
-# DoS prevention settings
-BULK_IMPORT_MAX_ITEMS: int = Field(
-    default=1000, description="Maximum items per bulk import request"
-)
-RATE_LIMIT_REQUESTS: int = Field(
-    default=100, description="Rate limit: requests per window"
-)
-RATE_LIMIT_WINDOW_SECONDS: int = Field(
-    default=60, description="Rate limit window in seconds"
-)
-```
-
-### 4. Existing Security Middleware
-
-**Location:** [backend/app/main.py](backend/app/main.py)
-
-Current middleware stack:
-
-1. **Easy Auth middleware** (`install_ezauth_middleware`) - Authentication via Azure Container Apps
-2. **User logging middleware** (`user_logging_middleware`) - Request logging with user context
-
-**No existing:**
-
-- Rate limiting middleware
-- Request body size validation
-- DoS protection middleware
-
-**CORS note:** CORS is handled at platform level (Azure Container Apps), not in code.
-
-### 5. Request Body Size
-
-FastAPI/Starlette default has no body size limit. Uvicorn default is unlimited. This should be addressed at multiple levels:
-
-- Application level: Validate list length in endpoint
-- Server level: Configure `--limit-max-body-size` in Uvicorn (bytes)
-- Platform level: Azure Container Apps ingress limits
-
-## Gap Analysis
-
-| Control | Current State | Required |
-|---------|--------------|----------|
-| Batch size limit | ❌ None | ✅ Configurable max items |
-| Rate limiting | ❌ None | ✅ Per-user/IP throttling |
-| Request body size | ❌ Unlimited | ✅ Configurable max bytes |
-| Validation before processing | ⚠️ Partial | ✅ Early rejection |
-
-## Recommended Implementation
-
-### Phase 1: Immediate (Batch Size Limit)
-
-1. Add `BULK_IMPORT_MAX_ITEMS` to `Settings` class
-2. Add validation at start of `import_bulk`:
-
-```python
-if len(items) > settings.BULK_IMPORT_MAX_ITEMS:
-    raise HTTPException(
-        status_code=400,
-        detail=f"Batch size {len(items)} exceeds maximum of {settings.BULK_IMPORT_MAX_ITEMS}"
-    )
-```
-
-### Phase 2: Rate Limiting
-
-1. Add `slowapi` dependency to `pyproject.toml`
-2. Configure rate limiter in `main.py`
-3. Apply rate limit decorator to bulk endpoints
-
-### Phase 3: Server-Level Protection
-
-1. Configure Uvicorn `--limit-max-body-size`
-2. Review Azure Container Apps ingress settings
-
-## Files to Modify
-
-| File | Change |
-|------|--------|
-| `backend/app/core/config.py` | Add DoS prevention settings |
-| `backend/app/api/v1/ground_truths.py` | Add batch size validation |
-| `backend/pyproject.toml` | Add slowapi dependency (Phase 2) |
-| `backend/app/main.py` | Install rate limiting middleware (Phase 2) |
-
-## References
-
-- [slowapi documentation](https://github.com/laurents/slowapi)
-- [FastAPI request body size](https://fastapi.tiangolo.com/advanced/request-body/)
-- [OWASP DoS Prevention](https://owasp.org/www-community/attacks/Denial_of_Service)
diff --git a/.copilot-tracking/subagent/20260122/draft-duplicate-detection-research.md b/.copilot-tracking/subagent/20260122/draft-duplicate-detection-research.md
deleted file mode 100644
index 7226d03..0000000
--- a/.copilot-tracking/subagent/20260122/draft-duplicate-detection-research.md
+++ /dev/null
@@ -1,287 +0,0 @@
-# Draft Duplicate Detection Research
-
-**Date:** 2026-01-22
-**Topic:** Draft duplicate detection system for warning SMEs about potential duplicates
-
----
-
-## Research Questions and Findings
-
-### 1. Data Model for Ground Truth Items (Draft vs Approved Status)
-
-**Backend Model:** [backend/app/domain/models.py](backend/app/domain/models.py)
-
-The `GroundTruthItem` class defines the core data model:
-
-```python
-class GroundTruthItem(BaseModel):
-    id: str
-    datasetName: str
-    bucket: Optional[UUID] = None
-    status: GroundTruthStatus = GroundTruthStatus.draft  # Default is draft
-    docType: str = "ground-truth-item"
-    schemaVersion: str = "v2"
-    
-    # Question/Answer fields
-    synth_question: str = Field(alias="synthQuestion")  # Original synthesized question
-    edited_question: Optional[str] = Field(default=None, alias="editedQuestion")  # User-edited version
-    answer: Optional[str] = None
-    refs: list[Reference] = []
-    
-    # Multi-turn support
-    history: Optional[list[HistoryItem]] = None
-    
-    # Tags
-    manual_tags: list[str] = []
-    computed_tags: list[str] = []
-```
-
-**Status Enum:** [backend/app/domain/enums.py](backend/app/domain/enums.py)
-
-```python
-class GroundTruthStatus(str, Enum):
-    draft = "draft"
-    approved = "approved"
-    deleted = "deleted"
-    skipped = "skipped"
-```
-
-**Frontend Model:** [frontend/src/models/groundTruth.ts](frontend/src/models/groundTruth.ts)
-
-```typescript
-export type GroundTruthItem = {
-    id: string;
-    question: string;  // Maps to editedQuestion or synthQuestion
-    answer: string;
-    history?: ConversationTurn[];
-    references: Reference[];
-    status: "draft" | "approved" | "skipped" | "deleted";
-    deleted?: boolean;  // Soft delete flag
-    // ...
-};
-```
-
----
-
-### 2. Fields for Duplicate Comparison
-
-**Primary Comparison Candidates:**
-
-| Field | Backend Name | Frontend Name | Notes |
-|-------|-------------|---------------|-------|
-| Original Question | `synthQuestion` | N/A (mapped to `question`) | The AI-generated/imported question text |
-| Edited Question | `editedQuestion` | `question` | User-curated question (takes precedence if set) |
-| Answer | `answer` | `answer` | The curated answer text |
-| Multi-turn History | `history` | `history` | Array of `{role, msg, refs}` for conversation turns |
-
-**Effective Question Logic:**
-- Backend: `synthQuestion` is the original; `editedQuestion` is the user's edited version
-- Frontend: Uses `editedQuestion || synthQuestion` as `question`
-- For duplicate detection: Compare `editedQuestion || synthQuestion` between items
-
-**Fingerprint/Signature Logic:** [frontend/src/hooks/useGroundTruth.ts](frontend/src/hooks/useGroundTruth.ts#L113-L135)
-
-The `stateSignature` function shows what fields define item identity:
-```typescript
-function stateSignature(it: GroundTruthItem): string {
-    return JSON.stringify({
-        id: it.id,
-        question: (it.question || "").trim(),
-        answer: (it.answer || "").trim(),
-        history: it.history || [],
-        references: refs,  // sorted by id
-        manualTags: [...(it.manualTags || [])].sort(),
-        status: it.status,
-        deleted: !!it.deleted,
-    });
-}
-```
-
-**Recommended Comparison Fields for Duplicate Detection:**
-1. **Question text** (normalized): `(editedQuestion || synthQuestion).trim().toLowerCase()`
-2. **Answer text** (normalized): `answer.trim().toLowerCase()`
-3. **Multi-turn content**: Concatenated `history[*].msg` for all turns
-
----
-
-### 3. Existing Duplicate Detection Logic
-
-**Finding: NO existing duplicate detection logic exists.**
-
-Grep search for `duplicate|similarity|compare` found:
-- References to Jira tickets requesting the feature (SA-534, SA-535)
-- Tag registry duplicate key prevention (unrelated)
-- Reference deduplication within a single item (not cross-item)
-
-**Existing Validation Service:** [backend/app/services/validation_service.py](backend/app/services/validation_service.py)
-
-Current validation only checks:
-- Manual tag values against the tag registry
-- No duplicate item detection
-
-**Jira Context:**
-- **SA-534:** "GTC: Duplicate Detection and Prevention for Drafts" (Spike, MVP label)
-- **SA-535:** "GTC: One time pass duplicate removal from drafts/approved"
-
-Both tickets indicate the requirement: *"As an SME I want to avoid working on draft items that are duplicates of approved items."*
-
----
-
-### 4. Import/Creation Flow for Draft Items
-
-**Bulk Import Endpoint:** [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L54-L114)
-
-```python
-@router.post("", response_model=ImportBulkResponse)
-async def import_bulk(
-    items: list[GroundTruthItem],
-    buckets: int | None = Query(default=None),
-    approve: bool = Query(default=False),
-) -> ImportBulkResponse:
-```
-
-**Current Import Flow:**
-1. Items received via POST `/v1/ground-truths`
-2. Generate IDs for items without one (randomname)
-3. Validate items via `validate_bulk_items()` (tags only)
-4. Optionally set approval metadata if `approve=true`
-5. Apply computed tags
-6. Persist via `container.repo.import_bulk_gt()`
-
-**Insertion Point for Duplicate Detection:**
-- After step 2 (ID generation), before step 5 (persistence)
-- Or as a pre-import validation step
-
-**Single Item Assignment:** [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py#L175-L220)
-
-When an SME assigns an item to themselves:
-1. Fetch the item
-2. Validate item can be assigned (not assigned to another user in draft)
-3. Set `status = draft`, `assignedTo = user`
-4. Create assignment document
-
-**Insertion Point:** Before or after step 3, check for duplicates against approved items.
-
----
-
-### 5. Warning/Notification Patterns in UI
-
-**Toast System:** [frontend/src/hooks/useToasts.ts](frontend/src/hooks/useToasts.ts)
-
-```typescript
-export type Toast = {
-    id: string;
-    kind: "success" | "error" | "info";
-    msg: string;
-    actionLabel?: string;
-    onAction?: () => void;
-};
-
-export function useToasts() {
-    // showToast(kind, msg, opts)
-    // opts: { duration, actionLabel, onAction }
-}
-```
-
-**Toast Component:** [frontend/src/components/common/Toasts.tsx](frontend/src/components/common/Toasts.tsx)
-
-- Displays in bottom-right corner
-- Color-coded by kind (success=emerald, error=rose, info=violet)
-- Supports action buttons for interactive toasts
-
-**Usage Pattern for Warnings:**
-```typescript
-showToast("info", "This draft may duplicate an approved item", {
-    duration: 8000,
-    actionLabel: "View Similar",
-    onAction: () => openSimilarItemsModal()
-});
-```
-
-**Alert Icon Component:** [frontend/src/components/app/QueueSidebar.tsx](frontend/src/components/app/QueueSidebar.tsx#L181)
-
-Uses `CircleAlert` from lucide-react for inline warnings:
-```tsx
-<CircleAlert className="h-3.5 w-3.5" /> unsaved
-```
-
----
-
-## Implementation Recommendations
-
-### Backend Duplicate Detection Service
-
-Create `backend/app/services/duplicate_detection_service.py`:
-
-```python
-class DuplicateDetectionService:
-    async def find_similar_approved(
-        self, 
-        item: GroundTruthItem,
-        threshold: float = 0.9
-    ) -> list[GroundTruthItem]:
-        """Find approved items similar to the given draft item."""
-        pass
-    
-    async def check_bulk_for_duplicates(
-        self, 
-        items: list[GroundTruthItem]
-    ) -> dict[str, list[str]]:
-        """Check a batch of items for duplicates. Returns {item_id: [similar_ids]}."""
-        pass
-```
-
-### Comparison Strategies
-
-1. **Exact Match:** Normalize and compare question text directly
-2. **Fuzzy Match:** Use Levenshtein distance or similar
-3. **Semantic Match:** Embed questions and use cosine similarity (future)
-
-### API Response Extension
-
-Extend `ImportBulkResponse` to include warnings:
-
-```python
-class ImportBulkResponse(BaseModel):
-    imported: int
-    errors: list[str]
-    uuids: list[str]
-    warnings: list[DuplicateWarning] = []  # NEW
-
-class DuplicateWarning(BaseModel):
-    draft_id: str
-    similar_approved_ids: list[str]
-    similarity_score: float
-```
-
-### Frontend Integration
-
-1. **On Import:** Show summary of potential duplicates
-2. **On Assignment:** Toast warning if assigned item resembles approved
-3. **In Editor:** Badge or inline warning in sidebar for flagged items
-
----
-
-## Summary
-
-| Question | Finding |
-|----------|---------|
-| Data model for draft/approved? | `GroundTruthStatus` enum with `draft`, `approved`, `deleted`, `skipped` |
-| Fields for comparison? | `synthQuestion`, `editedQuestion`, `answer`, `history[*].msg` |
-| Existing duplicate detection? | **None** - feature is requested in Jira (SA-534, SA-535) |
-| Import/creation flow? | Bulk import via POST `/v1/ground-truths`; single assign via assignment service |
-| UI warning patterns? | Toast system with `success/error/info` kinds; `CircleAlert` icon for inline warnings |
-
----
-
-## Files Referenced
-
-- [backend/app/domain/models.py](backend/app/domain/models.py) - GroundTruthItem model
-- [backend/app/domain/enums.py](backend/app/domain/enums.py) - GroundTruthStatus enum
-- [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py) - Import endpoint
-- [backend/app/services/validation_service.py](backend/app/services/validation_service.py) - Current validation
-- [backend/app/services/assignment_service.py](backend/app/services/assignment_service.py) - Assignment flow
-- [frontend/src/models/groundTruth.ts](frontend/src/models/groundTruth.ts) - Frontend model
-- [frontend/src/hooks/useGroundTruth.ts](frontend/src/hooks/useGroundTruth.ts) - State signature logic
-- [frontend/src/hooks/useToasts.ts](frontend/src/hooks/useToasts.ts) - Toast system
-- [frontend/src/components/common/Toasts.tsx](frontend/src/components/common/Toasts.tsx) - Toast component
diff --git a/.copilot-tracking/subagent/20260122/explorer-sorting-research.md b/.copilot-tracking/subagent/20260122/explorer-sorting-research.md
deleted file mode 100644
index 06227bb..0000000
--- a/.copilot-tracking/subagent/20260122/explorer-sorting-research.md
+++ /dev/null
@@ -1,257 +0,0 @@
-# Explorer Sorting System Research
-
-## Context
-
-Research into how the Explorer component implements column sorting, sort state management, visual indicators, and backend integration.
-
-## Sources Consulted
-
-### Codebase
-
-- [frontend/src/components/app/QuestionsExplorer.tsx](frontend/src/components/app/QuestionsExplorer.tsx): Main Explorer component with sorting logic
-- [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts): API service with sort parameter handling
-- [backend/app/domain/enums.py](backend/app/domain/enums.py): `SortField` and `SortOrder` enum definitions
-- [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py): API endpoint accepting sort parameters
-- [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py): Cosmos DB ORDER BY implementation
-
----
-
-## 1. Current Column Sorting Implementation
-
-### Frontend Sort State
-
-The Explorer manages sort state with two pieces of React state:
-
-```typescript
-type SortColumn = "refs" | "reviewedAt" | "hasAnswer" | null;
-type SortDirection = "asc" | "desc";
-
-const [sortColumn, setSortColumn] = useState<SortColumn>(null);
-const [sortDirection, setSortDirection] = useState<SortDirection>("desc");
-```
-
-### Sort Handler Logic
-
-The `handleSort` function implements a three-state toggle:
-
-1. **First click**: Set column, direction = `desc`
-2. **Second click (same column)**: Toggle direction to `asc`
-3. **Third click (same column)**: Clear sort (column = `null`, direction = `desc`)
-
-```typescript
-const handleSort = (column: "refs" | "reviewedAt" | "hasAnswer") => {
-  if (sortColumn === column) {
-    if (sortDirection === "desc") {
-      setSortDirection("asc");
-    } else {
-      setSortColumn(null);
-      setSortDirection("desc");
-    }
-  } else {
-    setSortColumn(column);
-    setSortDirection("desc");
-  }
-};
-```
-
----
-
-## 2. Available Sorting Options
-
-### Frontend Sortable Columns
-
-| Column | UI Label | API Parameter |
-|--------|----------|---------------|
-| `refs` | Refs | `totalReferences` |
-| `reviewedAt` | Reviewed | `reviewedAt` |
-| `hasAnswer` | Answer? | `hasAnswer` |
-
-### Backend SortField Enum
-
-```python
-class SortField(str, Enum):
-    reviewed_at = "reviewedAt"
-    updated_at = "updatedAt"
-    id = "id"
-    has_answer = "hasAnswer"
-    totalReferences = "totalReferences"
-```
-
-### Backend SortOrder Enum
-
-```python
-class SortOrder(str, Enum):
-    asc = "asc"
-    desc = "desc"
-```
-
-### Default Sort
-
-- **Backend default**: `reviewedAt DESC`
-- **Frontend default**: No sort applied (column = `null`)
-
----
-
-## 3. Visual Sort Indicator Implementation
-
-### Indicator Design
-
-Sort indicators use arrow symbols displayed inline with column headers:
-
-- **Descending**: `↓`
-- **Ascending**: `↑`
-
-### Two-State Visual System
-
-The Explorer shows two distinct indicator states:
-
-1. **Applied filter (violet)**: Shows the sort currently active in the backend response
-2. **Pending filter (amber, 50% opacity)**: Shows a selected but unapplied sort
-
-```tsx
-{appliedFilter.sortColumn === "refs" && (
-  <span className="text-violet-600">
-    {appliedFilter.sortDirection === "desc" ? "↓" : "↑"}
-  </span>
-)}
-{sortColumn === "refs" && sortColumn !== appliedFilter.sortColumn && (
-  <span className="text-amber-500 opacity-50">
-    {sortDirection === "desc" ? "↓" : "↑"}
-  </span>
-)}
-```
-
-### Known Issue
-
-SA-361 reports that the ascending sort visual indicator does not update correctly for the Answer column. The code structure appears correct, so the bug may be in the conditional rendering logic or state synchronization.
-
----
-
-## 4. Tag Count as a Sortable Field
-
-### Current Status
-
-**Tag count is NOT currently a sortable field.**
-
-### Backlog Item
-
-SA-684 requests this feature:
-
-> "GTC: Ability to sort by tag number effectively"
->
-> As a GTC user, I would like to be able to sort by tags descending to find ground truths that have fewer tags than expected to be able to find items needing review.
-
-### Implementation Requirements
-
-To add tag count sorting:
-
-#### Backend Changes
-
-1. Add `tagCount` to `SortField` enum:
-   ```python
-   class SortField(str, Enum):
-       # existing...
-       tag_count = "tagCount"
-   ```
-
-2. Add computed field or stored property `tagCount` to documents (similar to `totalReferences` backfill pattern)
-
-3. Add field mapping in `_build_secure_sort_clause`:
-   ```python
-   secure_field_map = {
-       # existing...
-       SortField.tag_count: "c.tagCount",
-   }
-   ```
-
-4. Create backfill script (follow `backfill_total_references.py` pattern)
-
-5. Update Cosmos DB indexing policy to include `tagCount`
-
-#### Frontend Changes
-
-1. Add `"tagCount"` to `SortColumn` type
-2. Add sortable column header in table
-3. Map frontend column name to API parameter
-
----
-
-## 5. Sorting Passed to Backend API
-
-### Frontend Service Call
-
-The Explorer builds API parameters from applied filter state:
-
-```typescript
-const sortByParam =
-  appliedFilter.sortColumn === "refs"
-    ? "totalReferences"
-    : appliedFilter.sortColumn;
-
-const params = {
-  // ...filters
-  sortBy: sortByParam,
-  sortOrder: sortByParam ? appliedFilter.sortDirection : undefined,
-  page: safePage,
-  limit: itemsPerPage,
-};
-
-listAllGroundTruths(params);
-```
-
-### Service Layer
-
-`groundTruths.ts` passes parameters to the generated API client:
-
-```typescript
-if (params.sortBy)
-  query.sortBy = params.sortBy as components["schemas"]["SortField"];
-if (params.sortOrder) query.sortOrder = params.sortOrder;
-```
-
-### API Endpoint
-
-`GET /v1/ground-truths` accepts query parameters:
-
-```python
-sort_by: SortField = Query(default=SortField.reviewed_at.value, alias="sortBy"),
-sort_order: SortOrder = Query(default=SortOrder.desc.value, alias="sortOrder"),
-```
-
-### Cosmos DB Query
-
-The repository builds a secure ORDER BY clause:
-
-```python
-def _build_secure_sort_clause(self, sort_field: SortField, sort_direction: SortOrder) -> str:
-    secure_field_map = {
-        SortField.id: "c.id",
-        SortField.updated_at: "c.updatedAt",
-        SortField.reviewed_at: "c.reviewedAt",
-        SortField.has_answer: "c.reviewedAt",
-        SortField.totalReferences: "c.totalReferences",
-    }
-    # ...builds "ORDER BY c.field ASC/DESC"
-```
-
-A secondary sort by `c.id ASC` is added for stable pagination when the primary sort field is not `id`.
-
----
-
-## Key Findings Summary
-
-| Question | Answer |
-|----------|--------|
-| How is sorting implemented? | React state (`sortColumn`, `sortDirection`) with three-state toggle handler |
-| Available sort options? | `refs` (totalReferences), `reviewedAt`, `hasAnswer` |
-| Sort direction indicator? | Arrow symbols (↓/↑), violet=applied, amber=pending |
-| Is tag count sortable? | No - requested in SA-684, not yet implemented |
-| How is sort passed to API? | `sortBy` and `sortOrder` query params to `GET /v1/ground-truths` |
-
----
-
-## Recommendations
-
-1. **SA-361 bug fix**: Investigate why ascending sort visual for Answer column doesn't update
-2. **SA-684 implementation**: Follow `totalReferences` pattern for computed `tagCount` field
-3. **Consider**: Adding `updatedAt` as a frontend sortable column (already supported by backend)
diff --git a/.copilot-tracking/subagent/20260122/explorer-state-preservation-research.md b/.copilot-tracking/subagent/20260122/explorer-state-preservation-research.md
deleted file mode 100644
index 7857b45..0000000
--- a/.copilot-tracking/subagent/20260122/explorer-state-preservation-research.md
+++ /dev/null
@@ -1,203 +0,0 @@
-# Explorer State Preservation Research
-
-**Research Date:** 2026-01-22
-**Related Issue:** SA-364 - GTC Explorer: Assign from explorer switches to curation view, losing filters
-
----
-
-## 1. Current Explorer Component Structure
-
-### Primary Component
-- **File:** [src/components/app/QuestionsExplorer.tsx](../../../frontend/src/components/app/QuestionsExplorer.tsx)
-- **Type:** Functional component with internal state management
-- **Purpose:** Displays ground truth items in a filterable, sortable table with actions (Assign, Inspect, Delete)
-
-### Component Hierarchy
-```
-App.tsx
-└── GTAppDemo (demo.tsx)
-    ├── AppHeader
-    ├── QuestionsExplorer (viewMode === "questions")
-    ├── CuratePane (viewMode === "curate")
-    └── StatsPage (viewMode === "stats")
-```
-
-### Key Interfaces
-
-```typescript
-interface FilterState {
-  status: FilterType;      // "all" | "draft" | "approved" | "skipped" | "deleted"
-  dataset: string;         // dataset name or "all"
-  tags: string[];          // array of selected tags (AND logic)
-  itemId: string;          // item ID filter text
-  refUrl: string;          // reference URL filter text
-  sortColumn: SortColumn;  // "refs" | "reviewedAt" | "hasAnswer" | null
-  sortDirection: SortDirection; // "asc" | "desc"
-}
-```
-
----
-
-## 2. Filter State Management Analysis
-
-### Current Implementation: Local Component State
-
-All filter state is managed via `useState` hooks **inside** `QuestionsExplorer`:
-
-```typescript
-// Filter state (unapplied - UI inputs)
-const [activeFilter, setActiveFilter] = useState<FilterType>("all");
-const [selectedDataset, setSelectedDataset] = useState<string>("all");
-const [selectedTags, setSelectedTags] = useState<string[]>([]);
-const [itemIdFilter, setItemIdFilter] = useState<string>("");
-const [referenceUrlFilter, setReferenceUrlFilter] = useState<string>("");
-const [sortColumn, setSortColumn] = useState<SortColumn>(null);
-const [sortDirection, setSortDirection] = useState<SortDirection>("desc");
-const [itemsPerPage, setItemsPerPage] = useState(25);
-
-// Applied filter state (what was last sent to backend)
-const [appliedFilter, setAppliedFilter] = useState<FilterState>({...});
-const [currentPage, setCurrentPage] = useState(1);
-```
-
-### Two-Phase Filter Pattern
-1. **Unapplied state:** User modifies filters in UI
-2. **Applied state:** User clicks "Apply Filters" button to execute query
-3. `hasUnappliedChanges` computed via `useMemo` to track dirty state
-
-### Problems with Current Approach
-- **No state lifting:** Filter state is entirely local to `QuestionsExplorer`
-- **No persistence:** When component unmounts (view switch), all state is lost
-- **No URL sync:** Filters are not reflected in URL params
-- **No context provider:** No shared state mechanism across views
-
----
-
-## 3. Navigation Actions That Cause State Loss
-
-### Identified Navigation Triggers
-
-| Action | Code Location | Effect |
-|--------|--------------|--------|
-| **Assign button** | `demo.tsx:207-229` | Calls `assignItem()`, then `setViewMode("curate")` |
-| **Header toggle** | `AppHeader.tsx:48-56` | Toggles between "curate" and "questions" |
-| **Stats button** | `AppHeader.tsx:57-64` | Sets `viewMode` to "stats" |
-
-### Critical Code Path (Assign Action)
-
-```typescript
-// demo.tsx lines 207-229
-onAssign={async (item) => {
-  // ...validation...
-  await assignItem(item.datasetName, item.bucket, item.id);
-  await gt.refreshList();
-  await gt.selectItem(item.id);
-  setViewMode("curate");  // <-- CAUSES UNMOUNT OF QuestionsExplorer
-  toast("success", `Assigned ${item.id} for curation`);
-}}
-```
-
-**Root cause:** `setViewMode("curate")` triggers React to unmount `QuestionsExplorer` and mount `CuratePane`, destroying all local filter state.
-
----
-
-## 4. State Persistence Mechanisms
-
-### Current State: **None implemented**
-
-#### localStorage
-- **Usage:** Commented out in `CuratePane.tsx` (line 163)
-- **Status:** Not active for any feature
-
-#### URL State / Query Parameters
-- **Routing library:** **None** - app uses simple `viewMode` state switching
-- **URL params:** Not used for any state persistence
-- **`package.json`:** No `react-router`, `@tanstack/router`, or similar
-
-#### Context API
-- **Existing contexts:** None for filter/view state
-- **Pattern:** App uses prop drilling from `GTAppDemo` to children
-
-#### Session/Browser APIs
-- `sessionStorage`: Not used
-- `history.pushState/replaceState`: Not used
-
----
-
-## 5. Routing Architecture
-
-### Current Implementation: **No Routing Library**
-
-The application uses a simple state-based view switching pattern:
-
-```typescript
-// demo.tsx
-const [viewMode, setViewMode] = useState<"curate" | "questions" | "stats">("curate");
-
-// Conditional rendering
-{viewMode === "stats" && <StatsPage ... />}
-{viewMode === "questions" && <QuestionsExplorer ... />}
-{viewMode === "curate" && <CuratePane ... />}
-```
-
-### Implications
-- No URL-based navigation
-- No browser back/forward support
-- No deep linking capability
-- No route-based code splitting
-
----
-
-## 6. Summary & Recommendations
-
-### Key Findings
-
-| Finding | Status | Impact |
-|---------|--------|--------|
-| Filter state is local to component | ✅ Confirmed | State lost on unmount |
-| No routing library | ✅ Confirmed | No URL-based persistence |
-| No localStorage usage | ✅ Confirmed | No browser persistence |
-| No Context for filters | ✅ Confirmed | No cross-view sharing |
-| Assign triggers view switch | ✅ Confirmed | Direct cause of SA-364 |
-
-### Recommended Solutions (Priority Order)
-
-#### Option A: Lift State to Parent (Minimal Change)
-- Move `FilterState` to `GTAppDemo`
-- Pass as props to `QuestionsExplorer`
-- State survives view switches
-- **Effort:** Low | **Risk:** Low
-
-#### Option B: URL Query Parameters (Better UX)
-- Sync filter state to URL search params
-- Use `URLSearchParams` API directly (no router needed)
-- Enables deep linking and back/forward
-- **Effort:** Medium | **Risk:** Low
-
-#### Option C: Context + localStorage (Full Persistence)
-- Create `ExplorerFilterContext`
-- Persist to localStorage on change
-- Restore on mount
-- **Effort:** Medium | **Risk:** Low
-
-#### Option D: Add React Router (Future-Proof)
-- Integrate routing library
-- Route-based view switching
-- URL state via loader/search params
-- **Effort:** High | **Risk:** Medium
-
-### Alternative Quick Fix
-Per SA-364 proposed solution #1:
-> "Do not automatically switch to the curation view when making an assignment from the explorer"
-
-This would involve removing `setViewMode("curate")` from the assign handler, keeping user in Explorer after assignment. However, this may not match desired UX if user wants to immediately curate the assigned item.
-
----
-
-## Files Referenced
-
-- [frontend/src/demo.tsx](../../../frontend/src/demo.tsx) - Main app container
-- [frontend/src/components/app/QuestionsExplorer.tsx](../../../frontend/src/components/app/QuestionsExplorer.tsx) - Explorer component
-- [frontend/src/components/app/AppHeader.tsx](../../../frontend/src/components/app/AppHeader.tsx) - Navigation header
-- [frontend/package.json](../../../frontend/package.json) - Dependencies
-- [prd-refined-2.json](../../../prd-refined-2.json) - Issue SA-364 definition
diff --git a/.copilot-tracking/subagent/20260122/explorer-view-research.md b/.copilot-tracking/subagent/20260122/explorer-view-research.md
deleted file mode 100644
index 8a8d9d8..0000000
--- a/.copilot-tracking/subagent/20260122/explorer-view-research.md
+++ /dev/null
@@ -1,51 +0,0 @@
----
-topic: explorer-view
-jtbd: JTBD-001
-date: 2026-01-22
-status: complete
----
-
-# Research: Explorer View
-
-## Context
-
-The explorer view enables browsing and filtering ground-truth items outside the assigned queue, and initiating actions such as inspection, assignment, and deletion.
-
-## Sources Consulted
-
-### URLs
-- (None)
-
-### Codebase
-- [frontend/src/components/app/QuestionsExplorer.tsx](frontend/src/components/app/QuestionsExplorer.tsx): Implements an explorer UI with filtering (status/dataset/tags/itemId/refUrl), sorting, pagination, and item actions.
-- [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts): Implements `listAllGroundTruths()` and maps API payloads into the frontend model.
-- [frontend/src/services/tags.ts](frontend/src/services/tags.ts): Fetches manual/computed tags and validates exclusive tag groups.
-
-### Documentation
-- [.copilot-tracking/research/20260121-high-level-requirements-research.md](.copilot-tracking/research/20260121-high-level-requirements-research.md): Provides repo-wide behavioral requirements context.
-
-## Key Findings
-
-1. The explorer supports server-backed listing (`GET /v1/ground-truths`) with query parameters for status, dataset, tags, itemId, refUrl, sorting, and pagination.
-2. The explorer fetches and displays available datasets and tags to drive filtering.
-3. The explorer UI includes a concept of “inspect” and “assign” actions per item, plus a delete action.
-4. The explorer assumes the backend provides pagination metadata when listing items.
-5. Doc-only gaps exist in documentation about searching/browsing, but the explorer implementation is the current source of truth.
-
-## Existing Patterns
-
-| Pattern | Location | Relevance |
-|---------|----------|-----------|
-| Filter state vs applied filter state | [frontend/src/components/app/QuestionsExplorer.tsx](frontend/src/components/app/QuestionsExplorer.tsx) | Implements explicit Apply behavior and avoids unnecessary calls |
-| Server-side sorting and pagination | [frontend/src/components/app/QuestionsExplorer.tsx](frontend/src/components/app/QuestionsExplorer.tsx) | Assumes backend performs sorting and returns pagination |
-| List API wrapper mapping wire schema to UI model | [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts) | Defines frontend expectations for list payload shape |
-
-## Open Questions
-
-- (None)
-
-## Recommendations for Spec
-
-- Specify supported explorer filters and sorting fields as observable UI capabilities.
-- Specify that the list view uses server-backed pagination when available.
-- Specify that explorer actions (inspect/assign/delete) are initiated from the UI but depend on backend support.
diff --git a/.copilot-tracking/subagent/20260122/export-snapshots-research.md b/.copilot-tracking/subagent/20260122/export-snapshots-research.md
deleted file mode 100644
index 4554699..0000000
--- a/.copilot-tracking/subagent/20260122/export-snapshots-research.md
+++ /dev/null
@@ -1,50 +0,0 @@
----
-topic: export-snapshots
-jtbd: JTBD-001
-date: 2026-01-22
-status: complete
----
-
-# Research: Export Snapshots
-
-## Context
-
-The export system generates downloadable JSON snapshots of curated data in configurable formats.
-
-## Sources Consulted
-
-### URLs
-- (None)
-
-### Codebase
-- [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts): Includes `downloadSnapshot` function that triggers a browser download of JSON.
-- [backend/app/services/snapshot_service.py](backend/app/services/snapshot_service.py): Implements snapshot export logic.
-
-### Documentation
-- [.copilot-tracking/research/20260121-high-level-requirements-research.md](.copilot-tracking/research/20260121-high-level-requirements-research.md): Consolidates export/snapshot requirements.
-- [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md): Documents attachment and artifact export modes, defaults, and manifest requirements.
-
-## Key Findings
-
-1. The backend supports two export modes: `attachment` (single JSON file) and `artifact` (per-item JSON files + manifest).
-2. The snapshot download endpoint returns a JSON document for browser download with Content-Disposition header.
-3. Artifact exports include a manifest with a stable `schemaVersion` and snapshot metadata.
-4. Export processors run before formatting and may merge tag fields into a single `tags` array.
-5. The frontend triggers download via a service function that invokes the snapshot endpoint.
-
-## Existing Patterns
-
-| Pattern | Location | Relevance |
-|---------|----------|-----------|
-| Snapshot export endpoint with Content-Disposition | [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md) | Defines wire behavior for download |
-| Manifest with schemaVersion | [backend/docs/export-pipeline.md](backend/docs/export-pipeline.md) | Defines contract for artifact mode |
-
-## Open Questions
-
-- (None)
-
-## Recommendations for Spec
-
-- Specify supported export modes (attachment, artifact) and their default behavior.
-- Specify that attachment mode returns a single JSON document with download headers.
-- Specify that artifact mode includes a manifest with `schemaVersion`.
diff --git a/.copilot-tracking/subagent/20260122/inspection-performance-research.md b/.copilot-tracking/subagent/20260122/inspection-performance-research.md
deleted file mode 100644
index 4df4f4e..0000000
--- a/.copilot-tracking/subagent/20260122/inspection-performance-research.md
+++ /dev/null
@@ -1,197 +0,0 @@
-# Inspection Performance Research
-
-**Date:** 2026-01-22
-**Topic:** Caching and memoization patterns for inspection modals
-
-## 1. InspectItemModal Implementation
-
-**Location:** [frontend/src/components/modals/InspectItemModal.tsx](frontend/src/components/modals/InspectItemModal.tsx)
-
-### Data Fetching Pattern
-
-The `InspectItemModal` component fetches complete item data on every open:
-
-```tsx
-// Lines 62-111
-useEffect(() => {
-  if (!isOpen || !item) {
-    setCompleteItem(null);
-    setLoadError(null);
-    return;
-  }
-
-  // Always fetch fresh data to ensure we get complete conversation history
-  setIsLoading(true);
-  setLoadError(null);
-
-  (async () => {
-    const completeItemData = await getGroundTruth(
-      item.datasetName || "",
-      item.bucket || "",
-      item.id,
-    );
-    setCompleteItem(completeItemData);
-  })()
-}, [isOpen, item]);
-```
-
-### Data Fetched
-
-- Complete `GroundTruthItem` via `getGroundTruth()` API call
-- Runtime configuration for trusted reference domains
-- Uses `MultiTurnEditor` component in read-only mode to display conversation
-
-### Performance Issue
-
-**No caching of previously viewed items.** Each time the modal opens for the same item, a fresh API call is made. The comment explicitly states "Always fetch fresh data" but this is unnecessary for recently viewed items in a read-only context.
-
-## 2. TurnReferencesModal Implementation
-
-**Location:** [frontend/src/components/app/editor/TurnReferencesModal.tsx](frontend/src/components/app/editor/TurnReferencesModal.tsx)
-
-### References Computation
-
-The modal filters references for a specific turn on every render:
-
-```tsx
-// Line 88 - computed on every render
-const turnRefs = references.filter((r) => r.messageIndex === messageIndex);
-```
-
-Additional computed values on every render:
-```tsx
-// Line 91 - set computed on every render
-const urlsInTurn = new Set(turnRefs.map((r) => normalizeUrl(r.url)));
-```
-
-### Performance Issue
-
-**No memoization for references filtering.** The `turnRefs` filter and `urlsInTurn` Set are recomputed on every render, even when `references` and `messageIndex` haven't changed.
-
-## 3. Existing Caching Patterns
-
-### Service-Level Caching
-
-| Service | Caching Pattern | TTL |
-|---------|-----------------|-----|
-| [datasets.ts](frontend/src/services/datasets.ts) | In-memory cache with TTL | 5 minutes |
-| [runtimeConfig.ts](frontend/src/services/runtimeConfig.ts) | Single-fetch cache (permanent) | Forever |
-
-**datasets.ts example:**
-```typescript
-const CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes
-let datasetsCache: { data: string[] | null; timestamp: number } = {
-  data: null,
-  timestamp: 0,
-};
-```
-
-**runtimeConfig.ts example:**
-```typescript
-let cachedConfig: RuntimeConfig | null = null;
-let configPromise: Promise<RuntimeConfig> | null = null;
-
-export async function getRuntimeConfig(): Promise<RuntimeConfig> {
-  if (cachedConfig) return cachedConfig;
-  if (configPromise) return configPromise;
-  // ... fetch and cache
-}
-```
-
-### No Ground Truth Item Caching
-
-The `groundTruths.ts` service has **no caching mechanism** for individual items. Each `getGroundTruth()` call makes a fresh API request.
-
-## 4. React Query / Data Fetching Library Status
-
-**React Query is NOT currently in use.**
-
-The `package.json` shows no `@tanstack/query` or `react-query` dependency:
-
-```json
-"dependencies": {
-  "@microsoft/applicationinsights-web": "^3.0.4",
-  "openapi-fetch": "^0.9.8",
-  "react": "^19.1.1",
-  // ... no react-query
-}
-```
-
-The reference in [connecting-e2e-best-practices.md](frontend/docs/connecting-e2e-best-practices.md) is documentation/guidance, not actual implementation.
-
-**Current data fetching approach:**
-- Direct `fetch()` calls via `openapi-fetch` client
-- Manual state management with `useState`/`useEffect`
-- No automatic caching, deduplication, or stale-while-revalidate patterns
-
-## 5. Existing Memoization Patterns
-
-### useCallback Usage
-
-Found in multiple hooks:
-
-| File | Pattern |
-|------|---------|
-| [useReferencesSearch.ts](frontend/src/hooks/useReferencesSearch.ts) | `runSearch`, `clearResults` wrapped in `useCallback` |
-| [useTags.ts](frontend/src/hooks/useTags.ts) | `refresh`, `ensureTag`, `filter` wrapped in `useCallback` |
-| [useToasts.ts](frontend/src/hooks/useToasts.ts) | `dismiss`, `clear`, `showToast` wrapped in `useCallback` |
-| [useGroundTruth.ts](frontend/src/hooks/useGroundTruth.ts) | Extensive `useCallback` usage for all actions |
-
-### useMemo Usage
-
-Found in components:
-
-| File | Pattern |
-|------|---------|
-| [QueueSidebar.tsx](frontend/src/components/app/QueueSidebar.tsx) | `ids` memoized |
-| [QuestionsExplorer.tsx](frontend/src/components/app/QuestionsExplorer.tsx) | `hasUnappliedChanges`, `displayItems` memoized |
-| [TagsEditor.tsx](frontend/src/components/app/editor/TagsEditor.tsx) | `suggestions` memoized |
-| [InstructionsPane.tsx](frontend/src/components/app/InstructionsPane.tsx) | Memoization used |
-| [useGroundTruth.ts](frontend/src/hooks/useGroundTruth.ts) | `qaChanged`, `canApprove`, `hasUnsaved` memoized |
-
-### Gaps in Memoization
-
-**InspectItemModal:** No `useMemo` or `useCallback` hooks used
-**TurnReferencesModal:** No `useMemo` for `turnRefs` or `urlsInTurn` computations
-
-## 6. Recommendations
-
-### Immediate Optimizations
-
-1. **Add item cache to InspectItemModal:**
-   - Implement LRU cache for recently viewed items
-   - Cache key: `${datasetName}:${bucket}:${id}`
-   - Suggested TTL: 2-5 minutes or LRU with 10-20 item limit
-
-2. **Memoize TurnReferencesModal computations:**
-   ```tsx
-   const turnRefs = useMemo(
-     () => references.filter((r) => r.messageIndex === messageIndex),
-     [references, messageIndex]
-   );
-   
-   const urlsInTurn = useMemo(
-     () => new Set(turnRefs.map((r) => normalizeUrl(r.url))),
-     [turnRefs]
-   );
-   ```
-
-### Medium-Term Improvements
-
-3. **Service-level item caching:**
-   - Add caching to `groundTruths.ts` similar to `datasets.ts` pattern
-   - Consider cache invalidation on save operations
-
-4. **Consider React Query adoption:**
-   - Provides automatic caching, deduplication, background refetch
-   - Simpler code for cache management
-   - Already documented in best practices
-
-## Summary
-
-| Component | Issue | Severity |
-|-----------|-------|----------|
-| InspectItemModal | No item caching - fetches on every open | High |
-| TurnReferencesModal | No memoization for references filter | Medium |
-| groundTruths.ts | No service-level item cache | Medium |
-| Overall | No React Query adoption | Low |
diff --git a/.copilot-tracking/subagent/20260122/keyword-search-research.md b/.copilot-tracking/subagent/20260122/keyword-search-research.md
deleted file mode 100644
index 6e2fc0b..0000000
--- a/.copilot-tracking/subagent/20260122/keyword-search-research.md
+++ /dev/null
@@ -1,162 +0,0 @@
-# Keyword Search Research
-
-## Research Questions Answered
-
-### 1. How does the Explorer currently fetch and display ground truth items?
-
-The Explorer component ([frontend/src/components/app/QuestionsExplorer.tsx](../../../frontend/src/components/app/QuestionsExplorer.tsx)) fetches data via `listAllGroundTruths()` from the groundTruths service. Key behaviors:
-
-- **Server-side pagination and filtering**: Uses `GET /v1/ground-truths` with query parameters
-- **Filter state vs applied state**: Separates filter UI state from applied/committed filters to batch changes
-- **Explicit Apply button**: Users must click "Apply Filters" to send filter changes to backend
-- **Parameters supported**: `status`, `dataset`, `tags`, `itemId`, `refUrl`, `sortBy`, `sortOrder`, `page`, `limit`
-
-### 2. What data fields exist on ground truth items that would need to be searched?
-
-From [frontend/src/models/groundTruth.ts](../../../frontend/src/models/groundTruth.ts) and [backend/app/domain/models.py](../../../backend/app/domain/models.py):
-
-**Primary text fields for keyword search:**
-| Field | Type | Description |
-|-------|------|-------------|
-| `question` | string | The question text (derived from `synthQuestion` or `editedQuestion`) |
-| `answer` | string | The answer text |
-| `history` | ConversationTurn[] | Multi-turn conversation history |
-| `history[].content` (msg) | string | Individual turn content (user or agent) |
-| `comment` | string | Free-form curator notes |
-
-**History/ConversationTurn structure (multi-turn):**
-```typescript
-type ConversationTurn = {
-  role: "user" | "agent";
-  content: string;
-  expectedBehavior?: ExpectedBehavior[];
-};
-```
-
-**Backend HistoryItem model:**
-```python
-class HistoryItem(BaseModel):
-    role: HistoryItemRole  # User or Assistant
-    msg: str
-    refs: Optional[list[Reference]] = None
-    expected_behavior: Optional[list[ExpectedBehavior]]
-```
-
-### 3. Is there any existing search functionality in the frontend or backend?
-
-**Backend search service exists but serves a different purpose:**
-
-- **File:** [backend/app/api/v1/search.py](../../../backend/app/api/v1/search.py)
-- **Endpoint:** `GET /v1/search?q=<query>&top=<limit>`
-- **Purpose:** Queries an external AI Search index for reference documents (not ground truth items)
-- **Implementation:** Delegates to `SearchService.query()` which uses a `SearchAdapter` for external search backends
-
-**Current filtering in Explorer (not keyword search):**
-- `itemId`: Case-sensitive partial match on item ID
-- `refUrl`: Case-sensitive partial match on reference URLs (item-level and history-level)
-- `tags`: Filter by manual/computed tags (AND logic)
-- `status`, `dataset`: Exact match filters
-
-**No existing keyword search for question/answer/history text content.**
-
-### 4. What API endpoints does the Explorer use to fetch items?
-
-| Endpoint | Method | Purpose |
-|----------|--------|---------|
-| `GET /v1/ground-truths` | GET | List/filter ground truths with pagination |
-| `GET /v1/ground-truths/{datasetName}/{bucket}/{item_id}` | GET | Get single item by ID |
-| `PUT /v1/ground-truths/{datasetName}/{bucket}/{item_id}` | PUT | Update item |
-| `DELETE /v1/ground-truths/{datasetName}/{bucket}/{item_id}` | DELETE | Soft-delete item |
-
-**List endpoint query parameters:**
-- `status`: Filter by status (draft, approved, skipped, deleted)
-- `dataset`: Filter by dataset name
-- `tags`: Comma-separated list of tags (AND logic)
-- `itemId`: Partial ID match
-- `refUrl`: Partial reference URL match
-- `sortBy`: Sort field (reviewedAt, totalReferences, hasAnswer)
-- `sortOrder`: asc or desc
-- `page`, `limit`: Pagination
-
-### 5. How is the data structured in Cosmos DB and what indexes exist?
-
-**Container structure:**
-- Uses MultiHash partition key: `[/datasetName, /bucket]`
-- Ground truth items have `docType: "ground-truth-item"`
-
-**Indexing policy** from [backend/scripts/indexing-policy.json](../../../backend/scripts/indexing-policy.json):
-
-```json
-{
-  "indexingMode": "consistent",
-  "automatic": true,
-  "includedPaths": [{ "path": "/*" }],
-  "excludedPaths": [{ "path": "/\"_etag\"/?" }],
-  "compositeIndexes": [
-    // For sorting by reviewedAt, updatedAt, totalReferences
-    [{"path": "/reviewedAt", "order": "descending"}, {"path": "/id", "order": "ascending"}],
-    [{"path": "/totalReferences", "order": "descending"}, {"path": "/id", "order": "ascending"}],
-    // ... more composite indexes for combined status + sort scenarios
-  ],
-  "fullTextIndexes": []  // Currently empty - no full-text search indexes
-}
-```
-
-**Key finding:** `fullTextIndexes` array is empty. Cosmos DB does support full-text search via `FullTextContains()` function, but requires explicit full-text indexing configuration.
-
----
-
-## Key Findings Summary
-
-### Current State
-1. **No keyword search exists** for searching text content in questions, answers, or multi-turn history
-2. Explorer supports filtering by ID, URL, tags, status, and dataset - but not text content
-3. An external search service exists but searches reference documents, not ground truth items
-4. Cosmos DB full-text indexing is not currently configured
-
-### Fields to Search
-For comprehensive keyword search across all conversation text:
-- `synthQuestion` / `editedQuestion` (question text)
-- `answer`
-- `history[*].msg` (all turn content - both user and agent messages)
-- Optionally: `comment` (curator notes)
-
-### Implementation Considerations
-
-**Option A: In-memory filtering (simple, limited scale)**
-- Fetch all items matching other filters, filter in memory
-- Pros: No infrastructure changes
-- Cons: Poor performance with large datasets, RU cost for fetching all items
-
-**Option B: Cosmos DB full-text search**
-- Add full-text indexes to indexing policy
-- Use `FullTextContains()` or `FullTextScore()` in queries
-- Pros: Native Cosmos support, server-side filtering
-- Cons: Requires index configuration, may not work with Cosmos emulator
-
-**Option C: Azure AI Search integration**
-- Index ground truth items in Azure AI Search
-- Leverage existing `SearchService` pattern
-- Pros: Advanced search capabilities, ranking
-- Cons: Additional infrastructure, sync complexity
-
-### Recommended Next Steps
-1. Determine scale requirements (how many items, how often searched)
-2. Decide on search scope (question only vs all text fields vs multi-turn history)
-3. Evaluate Cosmos DB full-text search feasibility (emulator compatibility)
-4. Design API contract for keyword search parameter
-
----
-
-## Sources Consulted
-
-### Codebase Files
-- [frontend/src/components/app/QuestionsExplorer.tsx](../../../frontend/src/components/app/QuestionsExplorer.tsx) - Explorer component implementation
-- [frontend/src/models/groundTruth.ts](../../../frontend/src/models/groundTruth.ts) - Frontend data model
-- [frontend/src/services/groundTruths.ts](../../../frontend/src/services/groundTruths.ts) - API service layer
-- [backend/app/api/v1/ground_truths.py](../../../backend/app/api/v1/ground_truths.py) - Ground truths API endpoints
-- [backend/app/api/v1/search.py](../../../backend/app/api/v1/search.py) - Existing search endpoint
-- [backend/app/domain/models.py](../../../backend/app/domain/models.py) - Backend data models
-- [backend/app/services/search_service.py](../../../backend/app/services/search_service.py) - Search service implementation
-- [backend/app/adapters/repos/cosmos_repo.py](../../../backend/app/adapters/repos/cosmos_repo.py) - Cosmos DB repository
-- [backend/scripts/indexing-policy.json](../../../backend/scripts/indexing-policy.json) - Cosmos DB index configuration
diff --git a/.copilot-tracking/subagent/20260122/modal-keyboard-handling-research.md b/.copilot-tracking/subagent/20260122/modal-keyboard-handling-research.md
deleted file mode 100644
index 3cbca53..0000000
--- a/.copilot-tracking/subagent/20260122/modal-keyboard-handling-research.md
+++ /dev/null
@@ -1,201 +0,0 @@
-# Modal Keyboard Handling Research
-
-**Date:** 2025-01-22  
-**Topic:** modal-keyboard-handling  
-**Status:** Complete
-
-## Key Findings Summary
-
-| Question | Finding |
-|----------|---------|
-| Modal/dialog library | Custom implementation using React Portals (`createPortal`) |
-| TurnReferencesModal location | [TurnReferencesModal.tsx](../../../frontend/src/components/app/editor/TurnReferencesModal.tsx) |
-| Keyboard handling approach | Per-modal `onKeyDown` handlers + `useModalKeys` hook |
-| Global keyboard system | Yes - `useGlobalHotkeys` and `ReferencesTabs` listeners |
-| Input field handling | `stopPropagation()` pattern used inconsistently |
-
----
-
-## 1. Modal/Dialog Component Library
-
-**Finding:** The project uses a **custom modal system** built on React Portals - no third-party dialog library.
-
-### Components
-
-| File | Purpose |
-|------|---------|
-| [ModalPortal.tsx](../../../frontend/src/components/modals/ModalPortal.tsx) | Portal wrapper rendering to `#modal-root` |
-| [InspectItemModal.tsx](../../../frontend/src/components/modals/InspectItemModal.tsx) | Read-only item inspection modal |
-| [TurnReferencesModal.tsx](../../../frontend/src/components/app/editor/TurnReferencesModal.tsx) | Reference management modal |
-| [TagsModal.tsx](../../../frontend/src/components/app/editor/TagsModal.tsx) | Tag management modal |
-
-### Portal Target
-
-```html
-<!-- frontend/index.html:12 -->
-<div id="modal-root"></div>
-```
-
----
-
-## 2. TurnReferencesModal Implementation
-
-**Location:** [frontend/src/components/app/editor/TurnReferencesModal.tsx](../../../frontend/src/components/app/editor/TurnReferencesModal.tsx)
-
-### Structure
-
-```tsx
-<ModalPortal>
-  <div className="fixed inset-0 z-50 ...">           {/* Backdrop */}
-    <button onClick={onClose} tabIndex={-1} />       {/* Backdrop close button */}
-    <div role="dialog" aria-modal="true" ...>        {/* Dialog container */}
-      {/* Header, content, footer */}
-    </div>
-  </div>
-</ModalPortal>
-```
-
-### Current Keyboard Handling (Lines 395-400)
-
-```tsx
-onKeyDown={(e) => {
-  // Allow Escape to close, but let other keys pass through
-  if (e.key === "Escape") {
-    e.stopPropagation();
-    onClose();
-  }
-}}
-```
-
-### Input Field Handler (Lines 442-447)
-
-```tsx
-<input
-  onKeyDown={(e) => {
-    if (e.key === "Enter") {
-      e.preventDefault();
-      handleSearchSubmit();
-    }
-  }}
-/>
-```
-
-**Issue:** Does NOT use `useModalKeys` hook or call `stopPropagation()` for the search input.
-
----
-
-## 3. Global Keyboard Shortcut System
-
-### useGlobalHotkeys Hook
-
-**Location:** [frontend/src/hooks/useGlobalHotkeys.ts](../../../frontend/src/hooks/useGlobalHotkeys.ts)
-
-```typescript
-// Handles: Cmd/Ctrl+S (save draft), Cmd/Ctrl+Enter (approve)
-// Checks isEditable before handling Enter
-window.addEventListener("keydown", onKeyDown);
-```
-
-### useModalKeys Hook
-
-**Location:** [frontend/src/hooks/useModalKeys.ts](../../../frontend/src/hooks/useModalKeys.ts)
-
-```typescript
-// Handles: Escape (close), Enter (confirm if not busy)
-// Checks isEditable before handling Enter
-// Used by InspectItemModal but NOT TurnReferencesModal
-```
-
-### ReferencesTabs Global Listener
-
-**Location:** [frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx](../../../frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx#L59-L76)
-
-```typescript
-// Handles: Cmd/Ctrl+1 (search tab), Cmd/Ctrl+2 (selected tab)
-// Checks isEditable before processing
-window.addEventListener("keydown", onKeyDown);
-```
-
----
-
-## 4. Input Field Focus and Event Handling
-
-### Pattern Analysis
-
-| Component | Pattern | Issue |
-|-----------|---------|-------|
-| TagsModal | `onKeyDown={(e) => e.stopPropagation()}` on outer div | ✅ Prevents ALL key events from propagating |
-| TurnReferencesModal | Only stops propagation for Escape | ⚠️ Other keys may leak to global listeners |
-| InspectItemModal | Uses `useModalKeys` hook | ✅ Hook checks `isEditable` |
-
-### TagsModal Pattern (Best Practice Found)
-
-```tsx
-<div
-  onClick={(e) => e.stopPropagation()}
-  onKeyDown={(e) => e.stopPropagation()}  // Blocks all keyboard events
-  role="dialog"
-  aria-modal="true"
->
-```
-
-### TurnReferencesModal Pattern (Current)
-
-```tsx
-<div
-  onClick={(e) => e.stopPropagation()}
-  onKeyDown={(e) => {
-    if (e.key === "Escape") {  // Only Escape is handled
-      e.stopPropagation();
-      onClose();
-    }
-  }}
-  role="dialog"
->
-```
-
----
-
-## 5. Potential Issues Identified
-
-### Issue 1: Inconsistent `stopPropagation()` Usage
-
-- **TagsModal** blocks ALL keyboard events from propagating
-- **TurnReferencesModal** only blocks Escape - other keys like `Cmd+1`, `Cmd+2` may trigger `ReferencesTabs` tab switching
-
-### Issue 2: Missing `useModalKeys` Hook
-
-- **TurnReferencesModal** implements its own partial keyboard handling
-- **InspectItemModal** uses the standardized `useModalKeys` hook
-- This creates inconsistent behavior across modals
-
-### Issue 3: Global Listener Race Conditions
-
-Multiple global `keydown` listeners exist:
-1. `useGlobalHotkeys` (save/approve)
-2. `useModalKeys` (escape/enter)
-3. `ReferencesTabs` (tab switching)
-
-Each checks `isEditable` independently, but order of execution is not guaranteed.
-
----
-
-## 6. Recommendations
-
-1. **Standardize keyboard handling** - Update TurnReferencesModal to use `useModalKeys` hook
-2. **Block all events on modal container** - Add `onKeyDown={(e) => e.stopPropagation()}` to prevent global listener interference
-3. **Keep input-specific handlers** - Let Enter in search input trigger search, not modal close
-4. **Consider event delegation** - Centralize keyboard event handling to avoid race conditions
-
----
-
-## Files Referenced
-
-- [frontend/src/components/app/editor/TurnReferencesModal.tsx](../../../frontend/src/components/app/editor/TurnReferencesModal.tsx)
-- [frontend/src/components/app/editor/TagsModal.tsx](../../../frontend/src/components/app/editor/TagsModal.tsx)
-- [frontend/src/components/modals/ModalPortal.tsx](../../../frontend/src/components/modals/ModalPortal.tsx)
-- [frontend/src/components/modals/InspectItemModal.tsx](../../../frontend/src/components/modals/InspectItemModal.tsx)
-- [frontend/src/hooks/useModalKeys.ts](../../../frontend/src/hooks/useModalKeys.ts)
-- [frontend/src/hooks/useGlobalHotkeys.ts](../../../frontend/src/hooks/useGlobalHotkeys.ts)
-- [frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx](../../../frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx)
-- [frontend/index.html](../../../frontend/index.html) (line 12 - modal-root div)
diff --git a/.copilot-tracking/subagent/20260122/observability-operations-research.md b/.copilot-tracking/subagent/20260122/observability-operations-research.md
deleted file mode 100644
index a5a9e4a..0000000
--- a/.copilot-tracking/subagent/20260122/observability-operations-research.md
+++ /dev/null
@@ -1,52 +0,0 @@
----
-topic: observability-operations
-jtbd: JTBD-001
-date: 2026-01-22
-status: complete
----
-
-# Research: Observability and Operations
-
-## Context
-
-The observability and operations system provides opt-in telemetry, error handling, health endpoints, and demo-safe operation modes.
-
-## Sources Consulted
-
-### URLs
-- (None)
-
-### Codebase
-- [backend/app/main.py](backend/app/main.py): Defines `GET /healthz` endpoint returning repo/backend info.
-- [frontend/src/services/telemetry.ts](frontend/src/services/telemetry.ts): Implements opt-in telemetry with safe no-op behavior.
-
-### Documentation
-- [.copilot-tracking/research/20260121-high-level-requirements-research.md](.copilot-tracking/research/20260121-high-level-requirements-research.md): Consolidates observability requirements.
-- [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md): Documents telemetry opt-in policy, error boundaries, and safe-by-default behavior.
-- [frontend/README.md](frontend/README.md): Describes demo mode and telemetry configuration.
-
-## Key Findings
-
-1. The backend exposes a `GET /healthz` endpoint that returns repository and backend status.
-2. Client telemetry is opt-in, disabled by default, and safe-by-default (no-op in demo mode or when configuration is missing).
-3. The UI provides an error boundary that catches rendering errors and shows a user-friendly fallback.
-4. Demo mode disables or safely no-ops telemetry and can use mock providers.
-5. Telemetry integration with Application Insights is available when configured.
-
-## Existing Patterns
-
-| Pattern | Location | Relevance |
-|---------|----------|-----------|
-| Health endpoint | [backend/app/main.py](backend/app/main.py) | Defines operational status check |
-| Opt-in telemetry with no-op fallback | [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md) | Defines safe-by-default policy |
-| Error boundary | [frontend/docs/OBSERVABILITY_IMPLEMENTATION.md](frontend/docs/OBSERVABILITY_IMPLEMENTATION.md) | Defines graceful error handling UX |
-
-## Open Questions
-
-- (None)
-
-## Recommendations for Spec
-
-- Specify that the backend exposes a health endpoint at `GET /healthz`.
-- Specify that client telemetry is opt-in and safe-by-default.
-- Specify that the UI provides an error boundary for rendering failures.
diff --git a/.copilot-tracking/subagent/20260122/partial-updates-research.md b/.copilot-tracking/subagent/20260122/partial-updates-research.md
deleted file mode 100644
index 552d73f..0000000
--- a/.copilot-tracking/subagent/20260122/partial-updates-research.md
+++ /dev/null
@@ -1,275 +0,0 @@
-# Partial Updates Research: SA-244
-
-**Research Date:** 2026-01-22
-**Topic:** Cosmos DB Partial Document Updates (Patch Operations)
-**JTBD:** Help optimize GTC performance and Cosmos usage
-
----
-
-## Executive Summary
-
-The GTC codebase currently uses **full document replacement** (`replace_item`, `upsert_item`) for most updates, but already has **one working patch implementation** for assignment operations. Expanding partial updates to additional operations would reduce network bandwidth, improve latency, and potentially lower RU consumption for common update patterns.
-
----
-
-## Current Codebase Analysis
-
-### 1. Update Methods Currently Used
-
-| Method | Location | Usage |
-|--------|----------|-------|
-| `replace_item` | [cosmos_repo.py#L1113](backend/app/adapters/repos/cosmos_repo.py#L1113) | Main GT update with ETag |
-| `replace_item` | [cosmos_repo.py#L1158](backend/app/adapters/repos/cosmos_repo.py#L1158) | GT update in retry loop |
-| `replace_item` | [cosmos_repo.py#L1869](backend/app/adapters/repos/cosmos_repo.py#L1869) | Assignment fallback (emulator) |
-| `upsert_item` | [cosmos_repo.py#L1126](backend/app/adapters/repos/cosmos_repo.py#L1126) | Create-if-missing fallback |
-| `upsert_item` | [cosmos_repo.py#L1214](backend/app/adapters/repos/cosmos_repo.py#L1214) | Non-ETag updates |
-| `upsert_item` | [tags_repo.py#L140](backend/app/adapters/repos/tags_repo.py#L140) | Tags document updates |
-| **`patch_item`** | [cosmos_repo.py#L1784](backend/app/adapters/repos/cosmos_repo.py#L1784) | Assignment operations ✅ |
-
-### 2. Existing Patch Implementation
-
-The `assign_to` method at line 1784 already uses patch operations successfully:
-
-```python
-patch_operations = [
-    {"op": "set", "path": "/assignedTo", "value": user_id},
-    {"op": "set", "path": "/assignedAt", "value": now},
-    {"op": "set", "path": "/status", "value": GroundTruthStatus.draft.value},
-    {"op": "set", "path": "/updatedAt", "value": now},
-]
-
-await gt.patch_item(
-    item=item_id,
-    partition_key=partition_key,
-    patch_operations=patch_operations,
-    filter_predicate=filter_predicate,
-)
-```
-
-This demonstrates the pattern is already in production use with conditional updates.
-
-### 3. Main Update Operations in the Codebase
-
-| Operation | Fields Changed | Current Method | Patch Candidate? |
-|-----------|----------------|----------------|------------------|
-| SME assignment | `assignedTo`, `assignedAt`, `status`, `updatedAt` | `patch_item` ✅ | Already using patch |
-| Status change | `status`, `updatedAt` | `replace_item` | ✅ High priority |
-| Answer approval | `status`, `reviewed_at`, `updatedBy`, `assignedTo`, `assignedAt` | `upsert_gt` | ✅ High priority |
-| Edit answer | `answer`, `edited_question`, `comment`, `updatedAt` | `upsert_gt` | ✅ Medium priority |
-| Add/update refs | `refs`, `totalReferences`, `updatedAt` | `upsert_gt` | ⚠️ Complex (array operations) |
-| Update tags | `manualTags`, `updatedAt` | `upsert_gt` | ✅ Medium priority |
-| Update history | `history` | `upsert_gt` | ⚠️ Complex (nested arrays) |
-| Curation instructions | Full document | `upsert_item` | ❌ Usually full doc |
-| Global tags | `tags` array | `upsert_item` | ⚠️ Could use add/remove |
-
----
-
-## Azure Cosmos DB Patch API Capabilities
-
-### Supported Operations
-
-| Operation | Description | Use Case |
-|-----------|-------------|----------|
-| `set` | Set field value (creates if missing) | Status updates, field edits |
-| `add` | Add to array or create field | Adding tags, refs |
-| `replace` | Replace existing value (fails if missing) | Strict updates |
-| `remove` | Remove field or array element | Clearing assignments |
-| `incr` | Increment numeric field | Counters |
-| `move` | Move value between paths | Field migrations |
-
-### Key Limitations
-
-1. **Max 10 operations** per patch request
-2. **Item must exist** - patch_item fails if item not found (unlike upsert)
-3. **No parameterized filter predicates** - SQL injection risk requires careful escaping
-4. **System fields immutable** - Cannot patch `_id`, `_ts`, `_etag`, `_rid`
-5. **Emulator compatibility** - May need fallback path (as implemented for assign_to)
-
-### Python SDK Syntax
-
-```python
-# Single operation
-operations = [{"op": "set", "path": "/status", "value": "approved"}]
-
-# Multiple operations
-operations = [
-    {"op": "set", "path": "/status", "value": "approved"},
-    {"op": "set", "path": "/reviewedAt", "value": now},
-    {"op": "set", "path": "/assignedTo", "value": None},
-    {"op": "remove", "path": "/assignedAt"},
-]
-
-# With conditional predicate
-response = await container.patch_item(
-    item=item_id,
-    partition_key=partition_key,
-    patch_operations=operations,
-    filter_predicate="FROM c WHERE c.status = 'draft'",
-    etag=etag,
-    match_condition=MatchConditions.IfNotModified
-)
-```
-
----
-
-## RU Cost Analysis
-
-### Microsoft Documentation Findings
-
-From the [FAQ](https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update-faq):
-
-> "Partial Document Update is normalized into request unit billing in the same way as other database operations. **Users shouldn't expect a significant reduction in RU.**"
-
-### Key Performance Benefits
-
-While RU cost may not dramatically decrease, partial updates provide:
-
-1. **Reduced Network Bandwidth** - Only changed fields transmitted
-2. **Lower End-to-End Latency** - Smaller payloads, faster processing
-3. **Atomic Conditional Updates** - Server-side filter predicates
-4. **Multi-Region Conflict Resolution** - Automatic path-level merging
-5. **Reduced Client CPU** - No read-modify-write cycle needed
-
-### Estimated Impact for GTC
-
-| Document Type | Typical Size | Fields Updated | Bandwidth Savings |
-|---------------|--------------|----------------|-------------------|
-| GroundTruthItem | 5-50 KB | 2-4 fields | 80-95% |
-| CurationInstructions | 1-5 KB | Full document | None |
-| Tags document | <1 KB | tags array | Minimal |
-| AssignmentDocument | <1 KB | Full document | Minimal |
-
-For large GroundTruthItems with extensive history/refs, the bandwidth savings could be significant.
-
----
-
-## Recommended Opportunities
-
-### Priority 1: Status/Assignment Updates (High Impact, Low Risk)
-
-**Target:** `upsert_gt` when only status-related fields change
-
-```python
-# New method: patch_status
-async def patch_status(
-    self, item_id: str, partition_key: list,
-    status: GroundTruthStatus,
-    assigned_to: str | None = None,
-    reviewed_at: datetime | None = None,
-    updated_by: str | None = None
-) -> bool:
-    now = datetime.now(timezone.utc).isoformat()
-    operations = [
-        {"op": "set", "path": "/status", "value": status.value},
-        {"op": "set", "path": "/updatedAt", "value": now},
-    ]
-    if assigned_to is not None:
-        operations.append({"op": "set", "path": "/assignedTo", "value": assigned_to})
-    # ... etc
-    return await self._patch_with_fallback(item_id, partition_key, operations)
-```
-
-**API Endpoints Affected:**
-- `PUT /v1/assignments/{dataset}/{bucket}/{item_id}` (approval)
-- `PUT /v1/ground-truths/{dataset}/{bucket}/{item_id}` (status change)
-
-### Priority 2: Field-Specific Updates (Medium Impact)
-
-**Target:** Single-field updates like `edited_question`, `answer`, `comment`
-
-```python
-async def patch_fields(
-    self, item_id: str, partition_key: list,
-    fields: dict[str, Any], etag: str | None = None
-) -> GroundTruthItem:
-    operations = [
-        {"op": "set", "path": f"/{k}", "value": v}
-        for k, v in fields.items()
-    ]
-    operations.append({"op": "set", "path": "/updatedAt", "value": now})
-    # ...
-```
-
-### Priority 3: Tags Updates (Medium Impact)
-
-**Target:** `tags_repo.py` operations
-
-```python
-# Instead of read-modify-write:
-operations = [{"op": "add", "path": "/tags/-", "value": new_tag}]
-```
-
-### Lower Priority / Complex Cases
-
-- **References array** - Complex nested updates, may need full replacement
-- **History array** - Deep nesting with refs inside, likely needs full document
-- **Curation instructions** - Usually full document updates
-
----
-
-## Implementation Considerations
-
-### 1. Emulator Compatibility
-
-The existing `assign_to` implementation shows the pattern:
-- Try `patch_item` first
-- Fall back to read-modify-replace for emulator
-
-```python
-if self.is_cosmos_emulator_in_use():
-    return await self._assign_to_with_read_modify_replace(item_id, user_id)
-return await self._assign_to_with_patch(item_id, user_id)
-```
-
-### 2. ETag Handling
-
-Patch operations support ETag for optimistic concurrency:
-
-```python
-await container.patch_item(
-    item=item_id,
-    partition_key=pk,
-    patch_operations=ops,
-    etag=etag,
-    match_condition=MatchConditions.IfNotModified
-)
-```
-
-### 3. Error Handling
-
-- **412 Precondition Failed** - Filter predicate not satisfied
-- **404 Not Found** - Item doesn't exist (patch_item requires existence)
-- **400 Bad Request** - Invalid path or operation
-
-### 4. Testing Strategy
-
-1. Unit tests for patch operation building
-2. Integration tests against emulator (with fallback verification)
-3. Integration tests against live Cosmos (if available)
-
----
-
-## Conclusion
-
-The codebase already has a working patch implementation for assignments. Expanding this pattern to status updates and field-specific edits would:
-
-1. **Reduce network bandwidth** by 80-95% for large documents
-2. **Improve latency** for common update operations
-3. **Enable atomic conditional updates** without read-modify-write cycles
-4. **Simplify conflict resolution** in multi-region scenarios
-
-**Recommended next steps:**
-1. Extract common patch helper method from `assign_to`
-2. Implement `patch_status` for approval/status changes
-3. Implement `patch_fields` for targeted field updates
-4. Add comprehensive emulator fallback testing
-
----
-
-## References
-
-- [Partial document update in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update)
-- [Get started with partial document update](https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update-getting-started)
-- [Partial document update FAQ](https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update-faq)
-- [Python SDK ContainerProxy.patch_item](https://learn.microsoft.com/en-us/python/api/azure-cosmos/azure.cosmos.containerproxy)
-- Existing implementation: [cosmos_repo.py#L1765-L1810](backend/app/adapters/repos/cosmos_repo.py#L1765)
diff --git a/.copilot-tracking/subagent/20260122/pii-detection-research.md b/.copilot-tracking/subagent/20260122/pii-detection-research.md
deleted file mode 100644
index d9ebdf2..0000000
--- a/.copilot-tracking/subagent/20260122/pii-detection-research.md
+++ /dev/null
@@ -1,379 +0,0 @@
-# PII Detection Research
-
-**Date:** 2026-01-22  
-**Story:** SA-669 - GTC Needs PII Check  
-**Status:** Research Complete
-
-## Executive Summary
-
-This document captures research findings for implementing PII detection in the Ground Truth Curator's bulk import flow. The feature should scan imported content for personally identifiable information (email addresses and phone numbers first) and warn users without blocking import.
-
----
-
-## 1. Current Import Flow Analysis
-
-### Bulk Import Endpoint
-
-**File:** [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L54-L114)
-
-The `import_bulk()` endpoint processes ground truth items through these steps:
-
-```
-1. Receive items via POST /v1/ground-truths
-2. Generate IDs for items without one (randomname)
-3. Validate items via validate_bulk_items() ← CURRENT VALIDATION HOOK
-4. Filter invalid items, collect errors
-5. Optionally set approval metadata if approve=true
-6. Apply computed tags to each item
-7. Persist via container.repo.import_bulk_gt()
-8. Return ImportBulkResponse with imported count, errors, and uuids
-```
-
-### Current Validation Service
-
-**File:** [backend/app/services/validation_service.py](backend/app/services/validation_service.py)
-
-The validation service currently:
-
-- Validates manual tags against the tag registry
-- Returns a dict mapping item ID to list of validation errors
-- Uses async/concurrent validation for performance
-- Pre-fetches tag registry once for all items (efficiency pattern)
-
-**Key functions:**
-
-- `validate_ground_truth_item(item, valid_tags_cache)` - validates single item
-- `validate_bulk_items(items)` - validates list concurrently
-
-### Fields Containing Scannable Content
-
-From [backend/app/domain/models.py](backend/app/domain/models.py#L52-L120):
-
-| Field | Type | Description | PII Scan Priority |
-|-------|------|-------------|-------------------|
-| `synth_question` | str | Primary question text | **High** |
-| `edited_question` | str | User-edited question | **High** |
-| `answer` | str | Answer content | **High** |
-| `comment` | str | Curator notes | **High** |
-| `history[].msg` | str | Multi-turn messages | **High** |
-| `refs[].content` | str | Reference content | Medium |
-| `refs[].keyExcerpt` | str | Key excerpt text | Medium |
-| `contextUsedForGeneration` | str | Context source | Medium |
-
----
-
-## 2. Python PII Detection Libraries
-
-### Microsoft Presidio (Recommended)
-
-**Package:** `presidio-analyzer`  
-**Repository:** https://github.com/microsoft/presidio  
-**License:** MIT
-
-**Pros:**
-
-- Microsoft-maintained, enterprise-grade
-- Extensible recognizer architecture
-- Supports custom patterns and ML models
-- Good out-of-box support for email, phone, SSN, credit cards
-- Active maintenance and community
-
-**Cons:**
-
-- Heavier dependency footprint (spaCy optional but recommended)
-- Requires model downloads for best accuracy
-
-**Usage Example:**
-
-```python
-from presidio_analyzer import AnalyzerEngine
-
-analyzer = AnalyzerEngine()
-results = analyzer.analyze(
-    text="Contact john.doe@example.com or call 555-123-4567",
-    entities=["EMAIL_ADDRESS", "PHONE_NUMBER"],
-    language="en"
-)
-# Returns list of RecognizerResult with entity_type, start, end, score
-```
-
-### Scrubadub
-
-**Package:** `scrubadub`  
-**Repository:** https://github.com/datascopeanalytics/scrubadub
-
-**Pros:**
-
-- Lightweight, pure Python
-- Simple API
-- Good for basic patterns
-
-**Cons:**
-
-- Less actively maintained
-- Fewer entity types
-- Lower accuracy than Presidio
-
-### Regex-Only Approach
-
-For MVP/Phase 1, simple regex patterns could suffice:
-
-```python
-import re
-
-EMAIL_PATTERN = re.compile(
-    r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
-)
-PHONE_PATTERN = re.compile(
-    r'\b(?:\+?1[-.\s]?)?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}\b'
-)
-```
-
-**Pros:** Zero dependencies, fast, simple  
-**Cons:** Higher false positive/negative rates, harder to extend
-
-### Recommendation
-
-**Phase 1:** Start with regex patterns for email and phone (per story requirements)  
-**Phase 2:** Migrate to Presidio for broader PII coverage and better accuracy
-
----
-
-## 3. Patterns to Detect (Per SA-669)
-
-Story states: "Detection focuses on high-signal patterns first (email addresses and phone numbers)."
-
-### Phase 1 Patterns
-
-| Pattern | Regex | Examples |
-|---------|-------|----------|
-| Email | `[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}` | user@domain.com |
-| Phone (US) | `(?:\+?1[-.\s]?)?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}` | 555-123-4567, (555) 123-4567 |
-
-### Future Patterns (Phase 2+)
-
-- SSN: `\d{3}-\d{2}-\d{4}`
-- Credit card: Luhn-validated 16-digit numbers
-- IP addresses: `\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}`
-- Names (ML-based via Presidio)
-
----
-
-## 4. Warning Flow Design
-
-### Requirements from SA-669
-
-> "If potential PII is detected, the system warns the user but still allows the import to proceed."
-
-### Proposed Response Model
-
-Extend `ImportBulkResponse` to include PII warnings:
-
-```python
-class PIIWarning(BaseModel):
-    item_id: str
-    field: str
-    pattern_type: str  # "email", "phone", etc.
-    snippet: str  # Masked snippet showing context
-    position: int  # Character position in field
-
-class ImportBulkResponse(BaseModel):
-    imported: int
-    errors: list[str]
-    uuids: list[str]
-    pii_warnings: list[PIIWarning] = Field(default_factory=list)  # NEW
-```
-
-### Flow Diagram
-
-```
-POST /v1/ground-truths
-         │
-         ▼
-   Generate IDs
-         │
-         ▼
-   Validate Tags (existing)
-         │
-         ▼
-┌────────────────────────┐
-│   PII Detection (NEW)  │
-│  - Scan text fields    │
-│  - Collect warnings    │
-│  - Continue import     │
-└────────────────────────┘
-         │
-         ▼
-   Apply Computed Tags
-         │
-         ▼
-   Persist to Cosmos DB
-         │
-         ▼
-   Return Response with:
-   - imported count
-   - errors
-   - uuids
-   - pii_warnings ◄── NEW
-```
-
----
-
-## 5. Recommended Integration Points
-
-### Option A: Extend `validation_service.py` (Recommended)
-
-Add PII scanning alongside tag validation:
-
-```python
-# validation_service.py
-
-async def scan_for_pii(item: GroundTruthItem) -> list[PIIWarning]:
-    """Scan item content fields for PII patterns."""
-    warnings = []
-    fields_to_scan = [
-        ("synthQuestion", item.synth_question),
-        ("editedQuestion", item.edited_question),
-        ("answer", item.answer),
-        ("comment", item.comment),
-    ]
-    
-    # Also scan history messages
-    for idx, turn in enumerate(item.history or []):
-        fields_to_scan.append((f"history[{idx}].msg", turn.msg))
-    
-    for field_name, content in fields_to_scan:
-        if content:
-            warnings.extend(_detect_pii_in_text(item.id, field_name, content))
-    
-    return warnings
-
-async def validate_bulk_items_with_pii(
-    items: list[GroundTruthItem]
-) -> tuple[dict[str, list[str]], list[PIIWarning]]:
-    """Validate items and scan for PII."""
-    validation_errors = await validate_bulk_items(items)
-    
-    # Scan for PII concurrently
-    pii_tasks = [scan_for_pii(item) for item in items]
-    pii_results = await asyncio.gather(*pii_tasks)
-    
-    all_warnings = []
-    for warnings in pii_results:
-        all_warnings.extend(warnings)
-    
-    return validation_errors, all_warnings
-```
-
-### Option B: New `pii_service.py`
-
-Create a dedicated service (better separation of concerns):
-
-```python
-# app/services/pii_service.py
-
-class PIIDetectionService:
-    def __init__(self):
-        self._email_pattern = re.compile(...)
-        self._phone_pattern = re.compile(...)
-    
-    def scan_text(self, text: str) -> list[PIIMatch]:
-        """Scan text for PII patterns."""
-        ...
-    
-    async def scan_item(self, item: GroundTruthItem) -> list[PIIWarning]:
-        """Scan all text fields in a ground truth item."""
-        ...
-    
-    async def scan_bulk(self, items: list[GroundTruthItem]) -> list[PIIWarning]:
-        """Scan multiple items concurrently."""
-        ...
-```
-
-### Recommendation
-
-**Option B (new service)** is preferred because:
-
-1. Follows existing service patterns (see `tagging_service.py`, `search_service.py`)
-2. Easier to test in isolation
-3. Cleaner separation from tag validation concerns
-4. Easier to evolve (e.g., swap regex for Presidio later)
-
----
-
-## 6. Implementation Checklist
-
-### Backend Changes
-
-- [ ] Create `app/services/pii_service.py` with regex-based detection
-- [ ] Add `PIIWarning` model to `app/domain/models.py`
-- [ ] Extend `ImportBulkResponse` with `pii_warnings` field
-- [ ] Call PII service in `import_bulk()` endpoint
-- [ ] Add unit tests for PII detection patterns
-- [ ] Add integration tests for bulk import with PII warnings
-
-### Configuration
-
-- [ ] Add `PII_DETECTION_ENABLED` feature flag (default: True)
-- [ ] Add `PII_PATTERNS` config for enabled pattern types
-
-### Documentation
-
-- [ ] Update API docs with new response field
-- [ ] Document PII detection patterns and limitations
-
----
-
-## 7. Test Cases
-
-### Unit Tests
-
-```python
-def test_detect_email_in_question():
-    item = GroundTruthItem(
-        synthQuestion="Contact support@company.com for help"
-    )
-    warnings = scan_for_pii(item)
-    assert len(warnings) == 1
-    assert warnings[0].pattern_type == "email"
-
-def test_detect_phone_in_answer():
-    item = GroundTruthItem(
-        answer="Call us at 555-123-4567"
-    )
-    warnings = scan_for_pii(item)
-    assert len(warnings) == 1
-    assert warnings[0].pattern_type == "phone"
-
-def test_no_pii_returns_empty():
-    item = GroundTruthItem(
-        synthQuestion="How do I reset my password?"
-    )
-    warnings = scan_for_pii(item)
-    assert len(warnings) == 0
-```
-
-### Integration Tests
-
-```python
-async def test_bulk_import_returns_pii_warnings(async_client):
-    payload = [{
-        "datasetName": "test",
-        "synthQuestion": "Email john@example.com for details"
-    }]
-    response = await async_client.post("/v1/ground-truths", json=payload)
-    assert response.status_code == 200
-    data = response.json()
-    assert data["imported"] == 1  # Import succeeds
-    assert len(data["pii_warnings"]) == 1  # Warning returned
-```
-
----
-
-## 8. References
-
-- **Story:** SA-669 - GTC Needs PII Check
-- **Presidio Docs:** https://microsoft.github.io/presidio/
-- **Existing Validation:** [backend/app/services/validation_service.py](backend/app/services/validation_service.py)
-- **Import Endpoint:** [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py)
-- **Domain Models:** [backend/app/domain/models.py](backend/app/domain/models.py)
diff --git a/.copilot-tracking/subagent/20260122/query-optimization-research.md b/.copilot-tracking/subagent/20260122/query-optimization-research.md
deleted file mode 100644
index 41ed0d9..0000000
--- a/.copilot-tracking/subagent/20260122/query-optimization-research.md
+++ /dev/null
@@ -1,277 +0,0 @@
----
-topic: query-optimization
-jtbd: JTBD-008
-date: 2026-01-22
-status: complete
-stories: SA-247, SA-248
----
-
-# Research: Query Optimization
-
-## Context
-
-The query optimization effort replaces expensive cross-partition queries with efficient patterns. This research identifies all Cosmos DB queries in the GTC codebase, analyzes their partition key usage, and provides recommendations for optimization.
-
-## Sources Consulted
-
-### Codebase
-
-- [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py): Main repository with all Cosmos DB queries
-- [config.py](backend/app/core/config.py): Configuration including pagination limits and `PAGINATION_TAG_FETCH_MAX`
-- [assignments.py](backend/app/api/v1/assignments.py): Assignment API endpoints
-- [tags_repo.py](backend/app/adapters/repos/tags_repo.py): Tags repository (uses point reads)
-
-### Documentation
-
-- [Optimize request cost in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-reads-writes): Point reads cost ~1 RU/KB, queries vary significantly
-- [Query an Azure Cosmos DB container](https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-query-container): Cross-partition queries fan out to all physical partitions
-- [Partitioning and horizontal scaling](https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview): Partition key selection best practices
-
-## Key Findings
-
-### 1. Partition Key Strategy
-
-**Current Strategy**: MultiHash hierarchical key on `[/datasetName, /bucket]`
-
-```python
-# From cosmos_repo.py line 205
-Partition key strategy: MultiHash hierarchical key on [/datasetName, /bucket].
-The `bucket` field is a UUID and is stored as its string representation.
-```
-
-**Implications**:
-
-- Single-partition queries require BOTH `datasetName` AND `bucket` values
-- Queries filtering only by `datasetName` are still cross-partition (across buckets)
-- Queries without either filter scan ALL partitions
-
-### 2. The Arbitrary 200 Limit (SA-248)
-
-Found in multiple locations as `min(limit, 200)` or `min(take, 200)`:
-
-| Location | Line | Context |
-|----------|------|---------|
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1405) | 1405 | `list_unassigned()`: `max_item_count=min(limit, 200)` |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1623) | 1623 | `_query_unassigned_by_selector()`: `max_item_count=min(take, 200)` |
-| [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L1678) | 1678 | `_query_unassigned_global_excluding_user()`: `max_item_count=min(take, 200)` |
-
-**Issue**: This hardcoded 200 limit caps how many unassigned items can be fetched per query, but:
-
-- It's undocumented and appears arbitrary
-- The comment doesn't explain why 200 was chosen
-- May cause issues if more items are needed for fair sampling across datasets
-- Creates inconsistency with `PAGINATION_TAG_FETCH_MAX` (500) in config
-
-### 3. Cross-Partition Queries Identified
-
-**`enable_scan_in_query=True` appears 16+ times**, indicating cross-partition queries:
-
-| Category | Count | Notes |
-|----------|-------|-------|
-| Total cross-partition queries | 16 | All use `enable_scan_in_query=True` |
-| Single-partition operations | 2 | Assignment lookups use `enable_scan_in_query=False` |
-| Point reads | 3 | `read_item()` calls with full partition key |
-
-## Expensive Query Inventory
-
-| Query Location | Query Pattern | Issue | Recommendation |
-|----------------|---------------|-------|----------------|
-| [cosmos_repo.py:525](backend/app/adapters/repos/cosmos_repo.py#L525) | `list_all_gt()` - `SELECT * FROM c` with optional status filter | Full container scan, no partition key filter | Add pagination, consider batch processing |
-| [cosmos_repo.py:759](backend/app/adapters/repos/cosmos_repo.py#L759) | `list_gt_paginated()` - ORDER BY with OFFSET/LIMIT | Cross-partition with sort | Already optimized with server-side pagination |
-| [cosmos_repo.py:1350](backend/app/adapters/repos/cosmos_repo.py#L1350) | `stats()` - `SELECT c.status FROM c` | Full container scan for counts | Use Change Feed or materialized view |
-| [cosmos_repo.py:1375](backend/app/adapters/repos/cosmos_repo.py#L1375) | `list_datasets()` - `SELECT DISTINCT VALUE c.datasetName` | Full container scan | Cache results, use Change Feed |
-| [cosmos_repo.py:1404](backend/app/adapters/repos/cosmos_repo.py#L1404) | `list_unassigned()` - status filter only | Cross-partition, capped at 200 | Could use composite index |
-| [cosmos_repo.py:1742](backend/app/adapters/repos/cosmos_repo.py#L1742) | `assign_to()` - `SELECT TOP 1 ... WHERE c.id = @id` | Cross-partition lookup by ID only | **Should use point read if PK known** |
-| [cosmos_repo.py:1818](backend/app/adapters/repos/cosmos_repo.py#L1818) | `_assign_to_with_read_modify_replace()` - `SELECT TOP 1 * FROM c WHERE c.id = @id` | Cross-partition for emulator | Inherent emulator limitation |
-| [cosmos_repo.py:1896](backend/app/adapters/repos/cosmos_repo.py#L1896) | `list_assigned()` - filter by `assignedTo` | Cross-partition by user | Consider separate index or container |
-| [cosmos_repo.py:_get_filtered_count](backend/app/adapters/repos/cosmos_repo.py#L944) | `SELECT VALUE COUNT(1)` | Cross-partition aggregation | Cache or use Change Feed |
-
-## Point Read Opportunities
-
-Per Microsoft documentation, point reads cost ~1 RU per KB vs queries which can cost 3-10+ RU:
-
-| Current Pattern | Location | Optimization |
-|-----------------|----------|--------------|
-| Query by ID for assignment | Line 1742 | If `datasetName` and `bucket` are available, use `read_item()` |
-| Get item after upsert | Multiple | Already uses `get_gt()` with point read ✓ |
-
-**Already optimized**:
-
-- `get_gt()` (line 1058) - Uses `read_item()` with full partition key
-- `get_curation_instructions()` (line 1086) - Uses `read_item()`
-- Tags repo (line 121) - Uses `read_item()`
-
-## Arbitrary Limit Analysis (SA-248)
-
-### Current Behavior
-
-The 200 limit appears in three methods related to unassigned item sampling:
-
-```python
-# cosmos_repo.py line 1405
-max_item_count=min(limit, 200)
-
-# cosmos_repo.py line 1623
-max_item_count=min(take, 200)
-
-# cosmos_repo.py line 1678
-max_item_count=min(take, 200)
-```
-
-### Implications
-
-1. **Fairness**: When sampling across datasets with different sizes, the 200 cap may prevent fair distribution
-2. **Performance**: The limit exists to prevent runaway queries but lacks documentation
-3. **Inconsistency**: Config has `PAGINATION_TAG_FETCH_MAX=500` but these use hardcoded 200
-4. **No server-side continuation**: If more items are needed, the code breaks out of the loop rather than using continuation tokens
-
-### Recommendation
-
-1. Make the limit configurable via `Settings` (e.g., `SAMPLING_QUERY_MAX_ITEMS`)
-2. Document the rationale (RU budget, memory constraints, etc.)
-3. Consider using continuation tokens for larger sampling needs
-4. Align with `PAGINATION_TAG_FETCH_MAX` or document why they differ
-
-## Recommendations for Spec
-
-### High Priority
-
-1. **Replace ID-only queries with point reads** when partition key is available
-   - `assign_to()` queries by ID then patches; if caller provides dataset/bucket, use point read
-   - Estimated savings: ~2-5 RU per operation
-
-2. **Make the 200 limit configurable**
-   - Add `SAMPLING_QUERY_MAX_ITEMS` to config
-   - Document the tradeoff between RU cost and sampling fairness
-
-3. **Add composite indexes** for common query patterns:
-   - `(status, assignedTo)` for unassigned queries
-   - `(datasetName, status)` for dataset-scoped queries
-
-### Medium Priority
-
-4. **Cache `stats()` results** using Change Feed or time-based invalidation
-   - Currently scans entire container for 3 counts
-   - Could use materialized counters updated via Change Feed
-
-5. **Cache `list_datasets()` results**
-   - Dataset list changes infrequently
-   - Use TTL-based cache or invalidate on import
-
-6. **Use continuation tokens** in sampling methods instead of hard caps
-   - More robust for larger datasets
-   - Better RU efficiency with pagination
-
-### Low Priority
-
-7. **Consider secondary container** for assignment tracking
-   - Current cross-partition `list_assigned()` could be single-partition with PK=`userId`
-   - Already have `assignments` container but it duplicates data
-
-8. **Monitor RU consumption** per query type
-   - Add diagnostics logging for RU charges
-   - Identify optimization candidates based on actual usage
-
-## Query Efficiency Summary
-
-| Query Type | Count | Partition Efficiency | Action Needed |
-|------------|-------|---------------------|---------------|
-| Point reads | 3 | ✅ Single partition | None |
-| Single-partition queries | 2 | ✅ Single partition | None |
-| Cross-partition with filter | 10 | ⚠️ Partial | Add indexes |
-| Full container scans | 4 | ❌ All partitions | Cache or redesign |
-
-## RU Monitoring Status
-
-### Current State
-
-**No RU monitoring implemented.** The codebase does not capture or log Request Unit (RU) consumption from Cosmos DB queries.
-
-The observability implementation ([OBSERVABILITY_IMPLEMENTATION.md](backend/docs/OBSERVABILITY_IMPLEMENTATION.md)) uses OpenTelemetry with Azure Monitor but does not include Cosmos DB RU metrics.
-
-### Recommendation
-
-Add RU logging for expensive operations:
-
-```python
-async def _execute_query_with_metrics(
-    self, 
-    query: str, 
-    parameters: list, 
-    operation_name: str
-) -> tuple[list, float]:
-    """Execute query and log RU consumption."""
-    items = []
-    total_ru = 0.0
-    
-    iterator = self._gt_container.query_items(
-        query=query,
-        parameters=parameters,
-        enable_scan_in_query=True,
-    )
-    
-    async for item in iterator:
-        items.append(item)
-    
-    # Get RU charge from response headers
-    total_ru = getattr(iterator, '_last_response_headers', {}).get(
-        'x-ms-request-charge', 0
-    )
-    
-    self._logger.info(
-        "cosmos.query.metrics",
-        extra={
-            "operation": operation_name,
-            "ru_charge": total_ru,
-            "item_count": len(items),
-        }
-    )
-    
-    return items, total_ru
-```
-
-## Indexing Policy Analysis
-
-The current indexing policy ([indexing-policy.json](backend/scripts/indexing-policy.json)) includes composite indexes for common sort patterns but lacks optimization for assignment queries:
-
-**Current composite indexes**:
-- `reviewedAt` + `id` (both directions)
-- `updatedAt` + `id`
-- `status` + `reviewedAt` + `id`
-- `totalReferences` + `id` (both directions)
-- `status` + `totalReferences` + `id`
-
-**Recommended additions**:
-```json
-[
-    {"path": "/status", "order": "ascending"},
-    {"path": "/assignedTo", "order": "ascending"}
-]
-```
-
-This would optimize the `list_unassigned()` and `list_assigned()` queries that filter by status and assignedTo.
-
-## Implementation Priorities
-
-### Phase 1 (SA-248 - Immediate)
-1. Remove `min(limit, 200)` cap from sampling methods
-2. Add configurable `SAMPLING_QUERY_MAX_ITEMS` setting
-3. Use continuation tokens for proper pagination
-
-### Phase 2 (SA-247 - Short-term)
-1. Add RU logging for expensive queries
-2. Cache `stats()` and `list_datasets()` results
-3. Add composite index for `(status, assignedTo)`
-
-### Phase 3 (Future)
-1. Consider global secondary index for status-only queries
-2. Evaluate Change Feed for materialized views
-3. Implement automatic query analysis/alerting
-
-## References
-
-- [Azure Cosmos DB Query Optimization](https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-query-container#avoid-cross-partition-queries)
-- [Partition Key Design Best Practices](https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview)
-- [Request Units in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/request-units)
-- [Composite Indexes](https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#composite-indexes)
-- Codebase: [cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py), [config.py](backend/app/core/config.py), [indexing-policy.json](backend/scripts/indexing-policy.json)
diff --git a/.copilot-tracking/subagent/20260122/reference-identity-research.md b/.copilot-tracking/subagent/20260122/reference-identity-research.md
deleted file mode 100644
index ac3eab8..0000000
--- a/.copilot-tracking/subagent/20260122/reference-identity-research.md
+++ /dev/null
@@ -1,172 +0,0 @@
-# Reference Identity Research
-
-**Date:** 2026-01-22  
-**Topic:** Reference identity system using chunk ID from search index as primary uniqueness key
-
-## Executive Summary
-
-The current reference system uses **URL as the primary de-duplication key** in the frontend, with a secondary `id` field that is assigned at display time (sequential `ref_0`, `ref_1`, etc.) rather than from the search index. The search index **does provide a chunk ID** (via `chunk_id` field) from the inference adapter, but it is only partially propagated through the system.
-
-## Findings
-
-### 1. Current Reference Data Model
-
-#### Backend Model ([backend/app/domain/models.py](../../../backend/app/domain/models.py#L13-L35))
-
-```python
-class Reference(BaseModel):
-    url: str = Field(description="Reference URL (required, non-empty)")
-    title: str | None = Field(default=None)
-    content: str | None = None
-    keyExcerpt: str | None = None
-    type: str | None = None
-    bonus: bool = False
-    messageIndex: Optional[int] = None
-```
-
-**Key observation:** The backend `Reference` model has **no `id` field**. URL is the only required identifier.
-
-#### Frontend Model ([frontend/src/models/groundTruth.ts](../../../frontend/src/models/groundTruth.ts#L16-L27))
-
-```typescript
-export type Reference = {
-    id: string;           // Required in frontend
-    title?: string;
-    url: string;          // Required
-    snippet?: string;
-    visitedAt?: string | null;
-    keyParagraph?: string;
-    bonus?: boolean;
-    messageIndex?: number;
-};
-```
-
-**Key observation:** Frontend requires an `id` field, but this is **generated locally** and not persisted.
-
-### 2. URL-Based De-duplication Location
-
-#### Primary De-duplication: [frontend/src/models/gtHelpers.ts](../../../frontend/src/models/gtHelpers.ts#L33-L46)
-
-```typescript
-export function dedupeReferences(
-    existing: Reference[],
-    chosen: Reference[],
-): Reference[] {
-    const makeKey = (r: Reference) =>
-        r.messageIndex !== undefined ? `${r.url}::turn${r.messageIndex}` : r.url;
-    
-    const map = new Map(existing.map((r) => [makeKey(r), r] as const));
-    for (const r of chosen) {
-        const key = makeKey(r);
-        if (!map.has(key)) {
-            map.set(key, r);
-        }
-    }
-    return Array.from(map.values());
-}
-```
-
-**De-duplication key:** `URL` (or `URL::turnN` for multi-turn contexts)
-
-#### TurnReferencesModal duplicate check: [frontend/src/components/app/editor/TurnReferencesModal.tsx](../../../frontend/src/components/app/editor/TurnReferencesModal.tsx#L85)
-
-```typescript
-const urlsInTurn = new Set(turnRefs.map((r) => normalizeUrl(r.url)));
-```
-
-### 3. Backend Storage/Persistence
-
-References are stored as part of `GroundTruthItem` documents in Cosmos DB:
-
-- **Top-level refs:** `GroundTruthItem.refs: list[Reference]`
-- **Turn-level refs:** `GroundTruthItem.history[].refs: list[Reference]`
-
-The backend persists all reference fields **except** the frontend-only `id`. The Reference model validates that URL cannot be empty ([backend/app/domain/models.py](../../../backend/app/domain/models.py#L31-L34)).
-
-### 4. Search Index Fields - Chunk ID Availability
-
-#### Chat/Inference Adapter: [backend/app/adapters/gtc_inference_adapter.py](../../../backend/app/adapters/gtc_inference_adapter.py#L103-L128)
-
-```python
-def _extract_references(self, calls: list[dict[str, Any]]) -> list[dict[str, Any]]:
-    for call in calls:
-        results = call.get("results", [])
-        for doc in results:
-            ref = {
-                "id": doc.get("chunk_id") or doc.get("id"),  # ✅ chunk_id IS available
-                "title": doc.get("title"),
-                "url": doc.get("url"),
-                "snippet": doc.get("content"),
-            }
-            references.append(ref)
-```
-
-**The chunk ID is extracted from search results** as `chunk_id` (preferred) or `id` (fallback).
-
-#### Azure AI Search Tool Processing: [backend/app/adapters/inference/inference.py](../../../backend/app/adapters/inference/inference.py#L419)
-
-```python
-call["results"].append({"title": titles[i], "url": urls[i], "chunk_id": ids[i]})
-```
-
-**The search index provides:** `titles[]`, `urls[]`, `ids[]` (chunk IDs) in the metadata.
-
-#### Frontend Search Service: [frontend/src/services/search.ts](../../../frontend/src/services/search.ts#L24-L53)
-
-```typescript
-function mapWireToReference(x: SearchResultWire): Reference | null {
-    // ...
-    let id: string = randId("ref");  // Default: random ID
-    if (typeof o.id === "string" && o.id) id = o.id;  // Use provided ID if available
-    else if (doc && typeof doc.id === "string") id = doc.id as string;
-    return { id, title, url, snippet, visitedAt: null, keyParagraph: "" };
-}
-```
-
-**Current behavior:** Uses the ID from search results if available, but falls back to random ID.
-
-### 5. Downstream Systems Affected by Identity Key Change
-
-| System | Current Usage | Impact of Change |
-|--------|---------------|------------------|
-| **De-duplication** ([gtHelpers.ts](../../../frontend/src/models/gtHelpers.ts)) | Uses URL | Must switch to chunk ID |
-| **Reference Updates** ([useReferencesEditor.ts](../../../frontend/src/hooks/useReferencesEditor.ts)) | Uses `ref.id` for patch targeting | Would use chunk ID instead |
-| **Export Pipeline** ([backend/app/exports/pipeline.py](../../../backend/app/exports/pipeline.py)) | Outputs refs with URL as key field | May need to include chunk ID |
-| **API Ground Truth Mapping** ([groundTruths.ts](../../../frontend/src/services/groundTruths.ts#L63-L100)) | Generates sequential `ref_N` IDs | Would need to preserve chunk ID from storage |
-| **Turn References Modal** ([TurnReferencesModal.tsx](../../../frontend/src/components/app/editor/TurnReferencesModal.tsx)) | Checks URL for duplicates | Would check chunk ID |
-| **SelectedTab** ([SelectedTab.tsx](../../../frontend/src/components/app/ReferencesPanel/SelectedTab.tsx)) | Displays and manages by `ref.id` | Unchanged (uses existing id field) |
-
-### 6. Gap Analysis
-
-| Layer | Current State | Required for Chunk ID Identity |
-|-------|--------------|-------------------------------|
-| **Search Index** | ✅ Provides `chunk_id` | No change needed |
-| **Inference Adapter** | ✅ Extracts `chunk_id` as `id` | No change needed |
-| **Backend Reference Model** | ❌ No `id` field | Add optional `id` field |
-| **Frontend Search Service** | ⚠️ Uses `id` if present, fallback to random | Ensure consistent propagation |
-| **API Mapping** | ❌ Generates sequential IDs | Preserve chunk ID from storage |
-| **De-duplication** | ❌ Uses URL | Switch to chunk ID |
-| **Backend Persistence** | ❌ Doesn't store `id` | Store chunk ID in Reference |
-
-## Recommendations
-
-1. **Add `id` field to backend Reference model** (optional, string)
-2. **Persist chunk ID** when saving references from chat/search
-3. **Update de-duplication logic** to use `id` (chunk ID) instead of URL
-4. **Update API mapping** to preserve stored chunk ID instead of generating sequential IDs
-5. **Maintain URL as fallback** for legacy data without chunk IDs
-
-## Files Referenced
-
-- [backend/app/domain/models.py](../../../backend/app/domain/models.py) - Backend Reference model
-- [frontend/src/models/groundTruth.ts](../../../frontend/src/models/groundTruth.ts) - Frontend Reference type
-- [frontend/src/models/gtHelpers.ts](../../../frontend/src/models/gtHelpers.ts) - De-duplication logic
-- [frontend/src/services/search.ts](../../../frontend/src/services/search.ts) - Search result mapping
-- [frontend/src/services/groundTruths.ts](../../../frontend/src/services/groundTruths.ts) - API-to-frontend mapping
-- [frontend/src/hooks/useReferencesEditor.ts](../../../frontend/src/hooks/useReferencesEditor.ts) - Reference editing hook
-- [frontend/src/components/app/editor/TurnReferencesModal.tsx](../../../frontend/src/components/app/editor/TurnReferencesModal.tsx) - Turn references UI
-- [frontend/src/components/app/ReferencesPanel/SelectedTab.tsx](../../../frontend/src/components/app/ReferencesPanel/SelectedTab.tsx) - Selected references UI
-- [backend/app/adapters/gtc_inference_adapter.py](../../../backend/app/adapters/gtc_inference_adapter.py) - Inference adapter
-- [backend/app/adapters/inference/inference.py](../../../backend/app/adapters/inference/inference.py) - Azure AI Search processing
-- [backend/app/exports/pipeline.py](../../../backend/app/exports/pipeline.py) - Export pipeline
-- [specs/reference-management.md](../../../specs/reference-management.md) - Reference management spec
diff --git a/.copilot-tracking/subagent/20260122/reference-management-research.md b/.copilot-tracking/subagent/20260122/reference-management-research.md
deleted file mode 100644
index ee3cc9a..0000000
--- a/.copilot-tracking/subagent/20260122/reference-management-research.md
+++ /dev/null
@@ -1,50 +0,0 @@
----
-topic: reference-management
-jtbd: JTBD-001
-date: 2026-01-22
-status: complete
----
-
-# Research: Reference Management
-
-## Context
-
-The reference management system supports adding, visiting, annotating, and removing supporting references that back ground-truth items.
-
-## Sources Consulted
-
-### URLs
-- (None)
-
-### Codebase
-- [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts): Maps top-level and per-history references into a unified reference list with `id`, `title`, `url`, `snippet`, `keyParagraph`, `visitedAt`, `bonus`, and `messageIndex`.
-- [frontend/CODEBASE.md](frontend/CODEBASE.md): Documents reference workflow behaviors including search, URL de-duplication, visited tracking, and key-paragraph editing.
-
-### Documentation
-- [.copilot-tracking/research/20260121-high-level-requirements-research.md](.copilot-tracking/research/20260121-high-level-requirements-research.md): Consolidates reference-related requirements and notes documentation gaps.
-- [frontend/src/components/app/defaultCurateInstructions.md](frontend/src/components/app/defaultCurateInstructions.md): Contains user-facing curation instructions including key paragraph constraints.
-
-## Key Findings
-
-1. References include a `keyParagraph` field with a minimum length constraint (≥40 characters) for approval eligibility.
-2. The UI tracks whether a reference has been visited (opened in a new tab) and uses this for approval gating.
-3. URL de-duplication is performed in the UI to prevent duplicate references.
-4. The frontend model unifies top-level `refs` and per-history `refs` into one reference list.
-5. References can be marked as "bonus" and can be associated with specific conversation turns via `messageIndex`.
-
-## Existing Patterns
-
-| Pattern | Location | Relevance |
-|---------|----------|-----------|
-| Reference mapping and normalization | [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts) | Defines current shape of reference objects in UI |
-| Approval gating on reference completeness | [frontend/CODEBASE.md](frontend/CODEBASE.md) | Defines behavioral constraints for saving/approving |
-
-## Open Questions
-
-- (None)
-
-## Recommendations for Spec
-
-- Specify the reference data shape (id, title, url, snippet, keyParagraph, visitedAt, bonus, messageIndex).
-- Specify the approval gating rules: at least one selected reference, all visited, keyParagraph ≥40 chars.
-- Specify URL de-duplication as a UI behavior.
diff --git a/.copilot-tracking/subagent/20260122/tag-filtering-research.md b/.copilot-tracking/subagent/20260122/tag-filtering-research.md
deleted file mode 100644
index 42f4d62..0000000
--- a/.copilot-tracking/subagent/20260122/tag-filtering-research.md
+++ /dev/null
@@ -1,240 +0,0 @@
-# Research: Tag Filtering System
-
-**Topic:** tag-filtering  
-**Date:** 2026-01-22  
-**Status:** Complete
-
-## Summary
-
-The tag filtering system in Ground Truth Curator allows users to filter items by tags in the Explorer view. Currently, the system supports **include-only** filtering with AND logic. A planned enhancement (SA-363) will add tri-state selection (include/exclude/neutral) and boolean logic for advanced filtering.
-
-## Key Findings
-
-### 1. Current Explorer Tag Filter UI
-
-**Location:** [frontend/src/components/app/QuestionsExplorer.tsx](frontend/src/components/app/QuestionsExplorer.tsx)
-
-The Explorer component maintains tag filter state:
-
-```typescript
-// Filter state (unapplied)
-const [selectedTags, setSelectedTags] = useState<string[]>([]);
-
-// Applied filter state
-const [appliedFilter, setAppliedFilter] = useState<FilterState>({
-  status: "all",
-  dataset: "all",
-  tags: [],
-  // ...
-});
-```
-
-**Current UI behavior:**
-- Tags are displayed in a collapsible section "Filter by Tags"
-- Manual tags and computed tags are shown separately (manual in violet, computed in slate with lock icon)
-- Clicking a tag toggles it between selected (include) and unselected (neutral)
-- Selected tags show a badge count and "Clear all" option
-- Multiple selected tags use **AND logic** ("items must have ALL selected tags")
-- Tags fetched via `fetchTagsWithComputed()` which returns `{ manualTags: string[], computedTags: string[] }`
-
-### 2. Tag State Management
-
-**Tag toggle function:**
-```typescript
-const handleTagToggle = (tag: string) => {
-  setSelectedTags((prev) =>
-    prev.includes(tag) ? prev.filter((t) => t !== tag) : [...prev, tag],
-  );
-};
-```
-
-**Current limitation:** Binary state only (selected vs unselected) - no exclusion state.
-
-### 3. Tag-Related Fields on Ground Truth Items
-
-**Location:** [backend/app/domain/models.py](backend/app/domain/models.py#L76-L86)
-
-```python
-class GroundTruthItem(BaseModel):
-    # Tag fields: manualTags are user-provided, computedTags are system-generated
-    manual_tags: list[str] = Field(default_factory=list, alias="manualTags")
-    computed_tags: list[str] = Field(default_factory=list, alias="computedTags")
-
-    @computed_field
-    @property
-    def tags(self) -> list[str]:
-        """Return a merged, sorted view of manual and computed tags."""
-        merged = set(self.manual_tags or []) | set(self.computed_tags or [])
-        return sorted(merged)
-```
-
-**Key points:**
-- `manualTags`: User-applied tags (editable)
-- `computedTags`: System-generated tags from plugins (read-only)
-- `tags`: Computed property merging both (for backward compatibility)
-
-### 4. Backend API Tag Filtering
-
-**Location:** [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L162-L230)
-
-The `list_all_ground_truths` endpoint accepts tags as a comma-separated string:
-
-```python
-@router.get("", response_model=GroundTruthListResponse)
-async def list_all_ground_truths(
-    tags: str | None = Query(default=None),
-    # ...
-):
-    # Tag validation
-    MAX_TAGS_PER_QUERY = 10
-    MAX_TAG_LENGTH = 100
-    
-    if tags is not None:
-        raw_tags = [tag.strip() for tag in tags.split(",")]
-        cleaned = [tag for tag in raw_tags if tag]
-        # Validation checks...
-        tag_list = cleaned if cleaned else None
-```
-
-**Frontend sends tags:**
-```typescript
-// In groundTruths.ts
-if (params.tags?.length) query.tags = params.tags.join(",");
-```
-
-### 5. Cosmos DB Query for Tag Filtering
-
-**Location:** [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py#L562-L571)
-
-```python
-def _build_query_filter(self, ..., tags: list[str] | None, ...):
-    if include_tags and tags:
-        normalized = [tag for tag in (tag.strip() for tag in tags) if tag]
-        for idx, tag in enumerate(normalized):
-            pname = f"@tag{idx}"
-            # Search across manualTags and computedTags
-            clauses.append(
-                f"(ARRAY_CONTAINS(c.manualTags, {pname}) OR "
-                f"ARRAY_CONTAINS(c.computedTags, {pname}))"
-            )
-            params.append({"name": pname, "value": tag})
-```
-
-**Current query pattern:**
-- Each tag becomes an AND clause
-- Searches both `manualTags` and `computedTags` arrays
-- Uses `ARRAY_CONTAINS()` function (not supported in Cosmos Emulator)
-
-### 6. Emulator Limitation
-
-**Location:** [backend/docs/cosmos-emulator-limitations.md](backend/docs/cosmos-emulator-limitations.md)
-
-> `ARRAY_CONTAINS SQL Function Not Supported` - Tag filtering tests must be skipped on emulator and run against real Cosmos DB.
-
-## Current Filter Capabilities
-
-| Capability | Status | Notes |
-|------------|--------|-------|
-| Include tags (AND) | ✅ Supported | Items must have ALL selected tags |
-| Exclude tags (NOT) | ❌ Not supported | Planned in SA-363 |
-| OR logic | ❌ Not supported | Planned in SA-363 |
-| Boolean expressions | ❌ Not supported | Planned in SA-363 |
-| Manual tags | ✅ Supported | Violet styling in UI |
-| Computed tags | ✅ Supported | Slate styling with lock icon |
-
-## Patterns Supporting Tri-State Selection (SA-363)
-
-### Frontend Changes Needed
-
-1. **State structure change:**
-```typescript
-// Current: string[] (selected tags)
-// Proposed: Map<string, 'include' | 'exclude'> or similar
-interface TagFilterState {
-  include: string[];
-  exclude: string[];
-}
-```
-
-2. **UI toggle pattern:**
-- Click 1: Neutral → Include (checkmark)
-- Click 2: Include → Exclude (X indicator)
-- Click 3: Exclude → Neutral (cleared)
-
-3. **Query parameter format:**
-```typescript
-// Option A: Separate params
-tags=tag1,tag2&excludeTags=tag3,tag4
-
-// Option B: Prefixed syntax
-tags=+tag1,+tag2,-tag3,-tag4
-```
-
-### Backend Changes Needed
-
-1. **API parameter changes:**
-```python
-@router.get("")
-async def list_all_ground_truths(
-    tags: str | None = Query(default=None),  # Include tags
-    exclude_tags: str | None = Query(default=None, alias="excludeTags"),  # New
-):
-```
-
-2. **Cosmos query for exclusion:**
-```python
-# NOT ARRAY_CONTAINS pattern
-for idx, tag in enumerate(excluded_tags):
-    pname = f"@excludeTag{idx}"
-    clauses.append(
-        f"NOT (ARRAY_CONTAINS(c.manualTags, {pname}) OR "
-        f"ARRAY_CONTAINS(c.computedTags, {pname}))"
-    )
-```
-
-### Advanced Boolean Logic (SA-363)
-
-The PRD specifies support for:
-```
-has frequency:common AND NOT(has difficulty:easy)
-```
-
-This would require:
-1. A query DSL parser on the backend
-2. Translation to Cosmos SQL WHERE clauses
-3. Frontend text input with validation
-
-## Recommendations for Implementation
-
-1. **Phase 1: Tri-state UI**
-   - Update `selectedTags` to `tagFilters: Map<string, 'include' | 'exclude'>`
-   - Add visual indicators for include/exclude states
-   - Implement three-click toggle pattern
-
-2. **Phase 2: Backend exclude support**
-   - Add `excludeTags` query parameter
-   - Update `_build_query_filter()` with NOT clauses
-   - Add integration tests (requires real Cosmos DB)
-
-3. **Phase 3: Boolean query input (optional)**
-   - Add text input for advanced queries
-   - Implement parser with AND/OR/NOT/parentheses
-   - Add validation and error display
-
-## Related Files
-
-| File | Purpose |
-|------|---------|
-| [frontend/src/components/app/QuestionsExplorer.tsx](frontend/src/components/app/QuestionsExplorer.tsx) | Explorer UI with tag filter |
-| [frontend/src/services/tags.ts](frontend/src/services/tags.ts) | Tag fetching and validation |
-| [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts) | API calls with tag params |
-| [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py) | List endpoint with tag filtering |
-| [backend/app/adapters/repos/cosmos_repo.py](backend/app/adapters/repos/cosmos_repo.py) | Cosmos queries with ARRAY_CONTAINS |
-| [backend/app/domain/models.py](backend/app/domain/models.py) | GroundTruthItem with tag fields |
-| [prd-refined-2.json](prd-refined-2.json) | SA-363 requirements for tri-state |
-
-## Open Questions
-
-1. Should the URL encoding for tag filters use separate params or a prefix syntax?
-2. How to handle emulator limitations for exclude queries in development?
-3. Should the boolean query input be a separate mode or integrated with chip selection?
diff --git a/.copilot-tracking/subagent/20260122/tag-glossary-research.md b/.copilot-tracking/subagent/20260122/tag-glossary-research.md
deleted file mode 100644
index 495ec3e..0000000
--- a/.copilot-tracking/subagent/20260122/tag-glossary-research.md
+++ /dev/null
@@ -1,209 +0,0 @@
-# Tag Glossary Research
-
-**Date:** 2026-01-22
-**Topic:** tag-glossary
-
-## Summary
-
-The system has a well-designed tag architecture with clear separation between **manual tags** (user-editable) and **computed tags** (auto-generated). However, there is **no existing infrastructure for tag definitions, descriptions, or a glossary**. Tags are stored as simple strings without metadata.
-
----
-
-## 1. Current Tag System Overview
-
-### Manual Tags
-
-Manual tags are user-editable and stored in the `manualTags` field on ground truth items. Default manual tags are configured via JSON file.
-
-**Location:** [backend/app/domain/manual_tags.json](backend/app/domain/manual_tags.json)
-
-**Current Manual Tag Groups:**
-
-| Group | Exclusive | Values |
-|-------|-----------|--------|
-| source | Yes | sme, sa, synthetic, sme_curated, user, other |
-| answerability | Yes | answerable, not_answerable, should_not_answer |
-| topic | No | general, compatibility, install, license, performance, security |
-| intent | No | informational, action, feedback, clarification, other |
-| expertise | Yes | expert, novice |
-| difficulty | Yes | easy, medium, hard |
-
-### Computed Tags
-
-Computed tags are auto-generated by plugins and stored in `computedTags` field. They are read-only in the UI.
-
-**Plugin Location:** [backend/app/plugins/computed_tags/](backend/app/plugins/computed_tags/)
-
-**Current Computed Tag Plugins:**
-
-| Plugin | Tag Key | Description (from code comments) |
-|--------|---------|----------------------------------|
-| DatasetPlugin | `dataset:_dynamic` | Tags documents with their dataset name |
-| QuestionLengthShortPlugin | `question_length:short` | Questions with ≤10 words |
-| QuestionLengthMediumPlugin | `question_length:medium` | Questions with 11-30 words |
-| QuestionLengthLongPlugin | `question_length:long` | Questions with >30 words |
-| SingleTurnPlugin | `turns:singleturn` | Documents with no/minimal history |
-| MultiTurnPlugin | `turns:multiturn` | Documents with >2 history turns |
-| NoAnswerPlugin | `answer:no_answer` | Ground truth answer is "NO_ANSWER" |
-| RetrievalBehaviorNoRefsPlugin | `retrieval_behavior:no_refs` | Zero references |
-| RetrievalBehaviorSinglePlugin | `retrieval_behavior:single` | Exactly one reference |
-| RetrievalBehaviorTwoRefsPlugin | `retrieval_behavior:two_refs` | Exactly two references |
-| RetrievalBehaviorRichPlugin | `retrieval_behavior:rich` | Three or more references |
-| ReferenceTypeArticlePlugin | `reference_type:article` | Contains CS# pattern URL |
-| ReferenceTypeHelpcenterPlugin | `reference_type:helpcenter` | Contains /help URL |
-
----
-
-## 2. Tag Definition Storage
-
-### Current State
-
-- **Manual Tags:** Stored in JSON config ([manual_tags.json](backend/app/domain/manual_tags.json)) with only `group`, `tags`, and `mutuallyExclusive` fields
-- **Computed Tags:** Defined in Python plugin classes with descriptions only in docstrings
-- **No description/definition field** exists in any tag model
-- **No glossary endpoint** or UI exists
-
-### Key Files
-
-| Purpose | Location |
-|---------|----------|
-| Manual tag config | [backend/app/domain/manual_tags.json](backend/app/domain/manual_tags.json) |
-| Manual tag provider | [backend/app/domain/manual_tags_provider.py](backend/app/domain/manual_tags_provider.py) |
-| Tag schema & rules | [backend/app/domain/tags.py](backend/app/domain/tags.py) |
-| Tag API endpoints | [backend/app/api/v1/tags.py](backend/app/api/v1/tags.py) |
-| Computed tag base | [backend/app/plugins/base.py](backend/app/plugins/base.py) |
-| Plugin registry | [backend/app/plugins/registry.py](backend/app/plugins/registry.py) |
-
----
-
-## 3. UI Tag Display
-
-### Components
-
-| Component | Location | Purpose |
-|-----------|----------|---------|
-| TagChip | [frontend/src/components/common/TagChip.tsx](frontend/src/components/common/TagChip.tsx) | Display individual tag with computed vs manual styling |
-| TagsEditor | [frontend/src/components/app/editor/TagsEditor.tsx](frontend/src/components/app/editor/TagsEditor.tsx) | Add/remove manual tags, display computed tags (read-only) |
-| InspectItemModal | [frontend/src/components/modals/InspectItemModal.tsx](frontend/src/components/modals/InspectItemModal.tsx) | Shows tags in item inspection view |
-
-### Tag Service
-
-[frontend/src/services/tags.ts](frontend/src/services/tags.ts) provides:
-- `fetchTagSchema()` - Get tag groups with exclusive rules
-- `fetchTagsWithComputed()` - Get manual and computed tags separately
-- `validateExclusiveTags()` - Validate exclusive group rules
-- `addTags()` - Add new manual tags to global registry
-
-### Current UI Behavior
-
-1. **Computed tags:** Displayed with lock icon and slate color scheme; read-only
-2. **Manual tags:** Displayed with violet color scheme; removable with X button
-3. **TagsEditor:** Shows "Auto-generated" label for computed tags section
-4. **No tooltips or definitions** are displayed for any tags
-
----
-
-## 4. API Endpoints
-
-| Endpoint | Purpose |
-|----------|---------|
-| `GET /v1/tags/schema` | Returns tag groups with values and exclusive rules |
-| `GET /v1/tags` | Returns `{ tags: [...], computedTags: [...] }` |
-| `POST /v1/tags` | Add tags to global registry |
-| `DELETE /v1/tags` | Remove tags from global registry |
-
-### Schema Response Shape
-
-```typescript
-interface TagSchemaResponse {
-  version: string;  // "v1"
-  groups: Array<{
-    name: string;
-    values: string[];
-    exclusive: boolean;
-    depends_on: Array<{ group: string; value: string }>;
-  }>;
-}
-```
-
-**Note:** No `description` field exists in the schema response.
-
----
-
-## 5. Gaps for Glossary Feature
-
-### Missing Infrastructure
-
-| Gap | Description | Impact |
-|-----|-------------|--------|
-| No description field in manual tag config | JSON only has group/tags/exclusive | Cannot store manual tag definitions |
-| No metadata in ComputedTagPlugin | Only `tag_key` and `compute()` | Computed tag descriptions only in docstrings |
-| No glossary API endpoint | No way to fetch all tag definitions | Frontend cannot display definitions |
-| No UI for viewing definitions | Tags displayed without context | Users don't know what tags mean |
-| No UI for managing definitions | No admin interface | Definitions cannot be edited |
-
-### Required Changes
-
-#### Backend
-
-1. **Extend manual tag JSON schema:**
-   ```json
-   {
-     "group": "source",
-     "description": "Where the ground truth originated",
-     "tags": [
-       { "value": "sme", "description": "Created by subject matter expert" },
-       { "value": "synthetic", "description": "AI-generated content" }
-     ]
-   }
-   ```
-
-2. **Add metadata to ComputedTagPlugin:**
-   ```python
-   class ComputedTagPlugin(ABC):
-       @property
-       @abstractmethod
-       def tag_key(self) -> str: ...
-
-       @property
-       @abstractmethod
-       def description(self) -> str:
-           """Human-readable description for glossary."""
-           ...
-   ```
-
-3. **Create glossary API endpoint:**
-   - `GET /v1/tags/glossary` returning all tags with definitions
-   - Merge manual tag definitions with computed tag descriptions
-
-#### Frontend
-
-1. **TagChip enhancement:** Add tooltip with tag definition on hover
-2. **Glossary component:** Full-page or modal view of all tag definitions
-3. **TagsEditor enhancement:** Show definition when selecting tags
-4. **Admin UI (optional):** Allow editing manual tag definitions
-
----
-
-## 6. Design Recommendations
-
-### Minimal Viable Glossary
-
-1. Add `description` field to manual tag JSON (backward compatible)
-2. Add `description` property to `ComputedTagPlugin` base class
-3. Create `GET /v1/tags/glossary` endpoint merging both sources
-4. Add tooltips to `TagChip` component showing definitions
-
-### Full Glossary Feature
-
-1. All above plus:
-2. Dedicated glossary page/modal in UI
-3. Admin interface for managing definitions
-4. Consider storing definitions in database for runtime updates
-
----
-
-## References
-
-- [Computed Tags Design](docs/computed-tags-design.md) - Full architecture
-- [Manual Tags Design](docs/manual-tags-design.md) - Provider pattern and validation
diff --git a/.copilot-tracking/subagent/20260122/validation-error-clarity-research.md b/.copilot-tracking/subagent/20260122/validation-error-clarity-research.md
deleted file mode 100644
index f6e5b65..0000000
--- a/.copilot-tracking/subagent/20260122/validation-error-clarity-research.md
+++ /dev/null
@@ -1,257 +0,0 @@
-# Validation Error Clarity Research
-
-**Date:** 2026-01-22  
-**Jira Reference:** SA-334 "Key Paragraph too large for generation error is not clear to the user"
-
----
-
-## Executive Summary
-
-The validation error clarity system has **significant gaps**. The 2000-character limit for key paragraphs is enforced **only in the frontend UI** (character counter display) but **not in the backend validation**. When errors occur, the frontend displays generic messages because `mapApiErrorToMessage()` extracts only the `detail` or `message` field from API errors without semantic mapping to user-friendly guidance.
-
----
-
-## Research Questions & Findings
-
-### 1. What is the key paragraph validation in the backend (2000 char limit)?
-
-**Finding: The 2000-character limit is NOT enforced in the backend.**
-
-- Backend `Reference` model at [backend/app/domain/models.py](backend/app/domain/models.py#L12-L24):
-  ```python
-  class Reference(BaseModel):
-      url: str = Field(description="Reference URL (required, non-empty)")
-      title: str | None = Field(default=None)
-      content: str | None = None
-      keyExcerpt: str | None = None  # <-- No max_length validation
-      type: str | None = None
-      bonus: bool = False
-      messageIndex: Optional[int] = None
-  ```
-  
-- The `keyExcerpt` field (maps to `keyParagraph` in frontend) has **no length constraints** defined.
-
-- The 2000-character limit exists **only in the frontend UI display** at [frontend/src/components/app/editor/TurnReferencesModal.tsx](frontend/src/components/app/editor/TurnReferencesModal.tsx#L341):
-  ```tsx
-  <span className={cn("rounded-full px-2 py-0.5", ...)}>
-    {len}/40 (2000 max)
-  </span>
-  ```
-  
-- This is purely informational - **no validation prevents submission of longer text**.
-
-### 2. How does the backend return validation errors?
-
-**Finding: Generic HTTPException pattern with `detail` field.**
-
-The backend uses FastAPI's `HTTPException` with a `detail` parameter:
-
-- Example from [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py#L234-L236):
-  ```python
-  raise HTTPException(
-      status_code=400,
-      detail=f"Tag '{tag[:50]}...' exceeds maximum length of {MAX_TAG_LENGTH} characters.",
-  )
-  ```
-
-- Validation errors return HTTP 422 with `HTTPValidationError` schema containing:
-  ```json
-  {
-    "detail": [
-      {
-        "type": "string",
-        "loc": ["body", "field_name"],
-        "msg": "validation error message",
-        "input": "..."
-      }
-    ]
-  }
-  ```
-
-- Chat endpoint uses safe error messages at [backend/app/api/v1/chat.py](backend/app/api/v1/chat.py#L20-L24):
-  ```python
-  SAFE_ERROR_MESSAGES = {
-      "invalid_input": "Invalid request format",
-      "service_unavailable": "Service temporarily unavailable",
-      "processing_error": "Unable to process request",
-  }
-  ```
-
-### 3. How does the frontend currently display validation errors?
-
-**Finding: Generic error display with minimal user guidance.**
-
-- Error mapping utility at [frontend/src/services/http.ts](frontend/src/services/http.ts#L26-L36):
-  ```typescript
-  export function mapApiErrorToMessage(err: unknown): string {
-    const e = err as Partial<ApiError & { data?: Record<string, unknown> }>;
-    if (e && typeof e === "object" && typeof e.status === "number") {
-      const data = e.data as Record<string, unknown> | undefined;
-      const detail =
-        (typeof data?.detail === "string" && data.detail) ||
-        (typeof data?.message === "string" && data.message) ||
-        "";
-      return `${e.status} ${e.statusText ?? "Error"}${detail ? ` – ${detail}` : ""}`;
-    }
-    return "Network or unexpected error";
-  }
-  ```
-
-- The `save()` function in [frontend/src/hooks/useGroundTruth.ts](frontend/src/hooks/useGroundTruth.ts#L248-L252) returns errors as-is:
-  ```typescript
-  } catch (e) {
-    const msg = e instanceof Error ? e.message : String(e);
-    return { ok: false, error: msg };
-  }
-  ```
-
-- **No error transformation** maps technical errors to user-friendly messages with remediation guidance.
-
-### 4. What error message mapping/transformation exists?
-
-**Finding: No semantic error mapping exists in either layer.**
-
-- Frontend does **not** have an error code registry or mapping table.
-- Backend uses `detail` strings directly without error codes.
-- The only pattern observed is `TagsModal.tsx` which has local `validationError` state for immediate UI feedback, but this doesn't apply to save operations.
-
-### 5. Where is key paragraph handled in the UI?
-
-**Locations identified:**
-
-| Component | File | Lines | Purpose |
-|-----------|------|-------|---------|
-| `TurnReferencesModal` | [frontend/src/components/app/editor/TurnReferencesModal.tsx](frontend/src/components/app/editor/TurnReferencesModal.tsx#L304-L341) | 304-341 | Primary key paragraph editor with character counter |
-| `useGroundTruth` | [frontend/src/hooks/useGroundTruth.ts](frontend/src/hooks/useGroundTruth.ts#L122) | 122, 154 | Trims keyParagraph in reference mapping |
-| API mapper | [frontend/src/adapters/apiMapper.ts](frontend/src/adapters/apiMapper.ts#L34) | 34, 64, 117, 127, 155 | Maps between `keyParagraph` (frontend) and `keyExcerpt` (backend) |
-
-**Key UI behavior:**
-- Character counter shows `{len}/40 (2000 max)` but this is **advisory only**
-- Textarea has no `maxLength` attribute
-- No client-side validation before submission
-
----
-
-## Gap Analysis
-
-### Current State vs Desired State
-
-| Aspect | Current State | Desired State |
-|--------|--------------|---------------|
-| Backend validation | None for keyExcerpt length | 2000 char limit enforced |
-| Error format | Generic `detail` string | Structured error with code, field, and remediation |
-| Frontend mapping | Pass-through display | Semantic mapping to user-friendly messages |
-| UI feedback | Post-submission error | Real-time validation + clear guidance |
-
-### Root Causes of SA-334
-
-1. **Missing backend validation**: The 2000-character limit mentioned in SA-334 doesn't exist in backend code
-2. **Generic error handling**: `mapApiErrorToMessage()` produces messages like `"400 Bad Request – Invalid request format"` without context
-3. **No error code system**: Cannot map backend errors to specific UI guidance
-4. **Frontend-only limit display**: The `(2000 max)` indicator suggests a limit that isn't enforced
-
----
-
-## Recommendations
-
-### Immediate (SA-334 Fix)
-
-1. **Add backend validation** for `keyExcerpt`:
-   ```python
-   # backend/app/domain/models.py
-   keyExcerpt: str | None = Field(default=None, max_length=2000)
-   ```
-
-2. **Add frontend validation** in `TurnReferencesModal.tsx`:
-   ```tsx
-   // Add maxLength to textarea
-   <textarea
-     maxLength={2000}
-     // existing props...
-   />
-   ```
-
-3. **Display specific error** when limit exceeded:
-   ```tsx
-   {len > 2000 && (
-     <span className="text-red-600">
-       Key paragraph exceeds 2000 character limit
-     </span>
-   )}
-   ```
-
-### Long-term (Validation Error Clarity System)
-
-1. **Create error code registry** in backend with structured errors:
-   ```python
-   class ValidationErrorCode(str, Enum):
-       KEY_PARAGRAPH_TOO_LONG = "KEY_PARAGRAPH_TOO_LONG"
-       TAG_EXCEEDS_LENGTH = "TAG_EXCEEDS_LENGTH"
-       # ...
-   ```
-
-2. **Build frontend error mapper** that translates codes to guidance:
-   ```typescript
-   const ERROR_MESSAGES: Record<string, ErrorGuidance> = {
-     KEY_PARAGRAPH_TOO_LONG: {
-       title: "Key paragraph too long",
-       message: "Shorten to under 2000 characters",
-       field: "keyParagraph"
-     }
-   };
-   ```
-
-3. **Add real-time validation** with character counter styling:
-   - Green when under limit
-   - Yellow when approaching (e.g., 1800+)
-   - Red when exceeded
-
----
-
-## Files for Modification
-
-| Priority | File | Change |
-|----------|------|--------|
-| High | [backend/app/domain/models.py](backend/app/domain/models.py) | Add `max_length=2000` to `keyExcerpt` |
-| High | [frontend/src/components/app/editor/TurnReferencesModal.tsx](frontend/src/components/app/editor/TurnReferencesModal.tsx) | Add maxLength, validation styling |
-| Medium | [frontend/src/services/http.ts](frontend/src/services/http.ts) | Create error mapping system |
-| Medium | [backend/app/api/v1/ground_truths.py](backend/app/api/v1/ground_truths.py) | Return structured validation errors |
-| Low | [frontend/src/hooks/useGroundTruth.ts](frontend/src/hooks/useGroundTruth.ts) | Surface validation errors per-field |
-
----
-
-## Appendix: Code References
-
-### Backend HTTPValidationError Schema
-
-From [frontend/src/api/openapi.json](frontend/src/api/openapi.json#L132-L136):
-```json
-{
-  "description": "Validation Error",
-  "content": {
-    "application/json": {
-      "schema": {
-        "$ref": "#/components/schemas/HTTPValidationError"
-      }
-    }
-  }
-}
-```
-
-### Frontend Reference Model Mapping
-
-From [frontend/src/services/groundTruths.ts](frontend/src/services/groundTruths.ts#L76):
-```typescript
-keyParagraph: r.keyExcerpt ?? undefined,
-```
-
-The field is called `keyParagraph` in frontend models and `keyExcerpt` in the backend/API schema.
-
-### Configuration Flag
-
-From [backend/app/core/config.py](backend/app/core/config.py#L89):
-```python
-REQUIRE_KEY_PARAGRAPH: bool = False  # Require key paragraphs for relevant references
-```
-
-This flag controls whether key paragraphs are required for approval, but doesn't enforce length limits.
diff --git a/.copilot-tracking/subagent/20260122/xss-sanitization-research.md b/.copilot-tracking/subagent/20260122/xss-sanitization-research.md
deleted file mode 100644
index 26623a5..0000000
--- a/.copilot-tracking/subagent/20260122/xss-sanitization-research.md
+++ /dev/null
@@ -1,195 +0,0 @@
-# XSS Sanitization Research - SA-565
-
-**Date:** 2025-01-22  
-**Component:** TurnReferencesModal.tsx and related components  
-**Story:** SA-565  
-
-## Executive Summary
-
-The frontend codebase **does NOT use `dangerouslySetInnerHTML`**, which is the primary XSS attack vector in React. The "key paragraph" fields in `TurnReferencesModal.tsx` are rendered via controlled `<textarea>` elements, which are safe by design. However, there are other user-generated content patterns that warrant review.
-
-## Question 1: Vulnerable Code Location in TurnReferencesModal.tsx
-
-### Location
-- File: [frontend/src/components/app/editor/TurnReferencesModal.tsx](frontend/src/components/app/editor/TurnReferencesModal.tsx#L320-L370)
-
-### Key Paragraph Rendering (Lines 320-370)
-
-The key paragraph section uses a controlled `<textarea>`:
-
-```tsx
-<textarea
-    className={cn(...)}
-    placeholder={readOnly ? "" : "Summarize the most relevant..."}
-    value={r.keyParagraph || ""}
-    onChange={
-        readOnly
-            ? undefined
-            : (e) => onUpdateReference(r.id, { keyParagraph: e.target.value })
-    }
-    readOnly={readOnly}
-    rows={...}
-/>
-```
-
-**Assessment:** This is **SAFE**. React `<textarea>` with `value` prop escapes all content automatically. There is no XSS vulnerability here.
-
-### Reference Title/URL Rendering (Lines 244-268)
-
-```tsx
-<div className="break-words text-sm font-medium">
-    [{index + 1}] {r.title || urlToTitle(r.url)}
-</div>
-<a
-    className="inline-flex max-w-full items-center gap-1 truncate text-xs text-violet-700 underline"
-    onClick={(e) => { e.preventDefault(); onOpenReference(r); }}
-    href={normalizeUrl(r.url)}
-    target="_blank"
-    rel="noreferrer"
->
-    <ExternalLink className="h-3.5 w-3.5" /> {normalizeUrl(r.url)}
-</a>
-```
-
-**Assessment:** **SAFE** - React's JSX automatically escapes text content in curly braces.
-
-## Question 2: Existing Sanitization Libraries
-
-### package.json Dependencies
-
-```json
-{
-    "dependencies": {
-        "react-markdown": "^9.0.3",
-        "remark-gfm": "^4.0.0"
-    }
-}
-```
-
-### Findings
-
-| Library | Purpose | XSS Protection |
-|---------|---------|----------------|
-| `react-markdown` | Markdown rendering | ✅ Built-in HTML sanitization by default |
-| `remark-gfm` | GitHub Flavored Markdown | Plugin only, inherits react-markdown's safety |
-
-**No explicit sanitization libraries** like `DOMPurify` or `xss` are installed. The codebase relies on:
-1. React's automatic escaping of JSX expressions
-2. react-markdown's built-in sanitization
-
-## Question 3: Components Rendering User-Generated Content
-
-### Components Analyzed
-
-| Component | User Content Rendered | Method | Risk Level |
-|-----------|----------------------|--------|------------|
-| [TurnReferencesModal.tsx](frontend/src/components/app/editor/TurnReferencesModal.tsx) | `keyParagraph`, `title`, `url` | `<textarea>`, JSX interpolation | ✅ Low |
-| [SelectedTab.tsx](frontend/src/components/app/ReferencesPanel/SelectedTab.tsx) | `keyParagraph`, `title`, `url` | `<textarea>`, JSX interpolation | ✅ Low |
-| [ConversationTurn.tsx](frontend/src/components/app/editor/ConversationTurn.tsx) | `turn.content` | `MarkdownRenderer` component | ✅ Low |
-| [MarkdownRenderer.tsx](frontend/src/components/common/MarkdownRenderer.tsx) | Markdown content | `ReactMarkdown` | ✅ Low |
-| [InspectItemModal.tsx](frontend/src/components/modals/InspectItemModal.tsx) | Item data, references | JSX, `MultiTurnEditor` | ✅ Low |
-
-### URL Handling in InspectItemModal.tsx
-
-Found a comprehensive URL validation utility at [InspectItemModal.tsx#L28-L54](frontend/src/components/modals/InspectItemModal.tsx#L28-L54):
-
-```tsx
-const validateReferenceUrl = (url: string): boolean => {
-    try {
-        const parsedUrl = new URL(url);
-        
-        // Only allow safe protocols
-        const allowedProtocols = ["http:", "https:"];
-        if (!allowedProtocols.includes(parsedUrl.protocol)) {
-            return false;
-        }
-        
-        // Block known malicious patterns
-        const maliciousPatterns = [
-            /javascript:/i, /data:/i, /vbscript:/i, /about:/i, /blob:/i
-        ];
-        
-        if (maliciousPatterns.some(pattern => pattern.test(url))) {
-            return false;
-        }
-        return true;
-    } catch (_error) {
-        return false;
-    }
-};
-```
-
-**This validation is only used in `InspectItemModal`**, not in `TurnReferencesModal`.
-
-## Question 4: React Best Practices for User Content
-
-### Current State
-
-React provides automatic XSS protection through:
-1. **JSX Expression Escaping:** All `{value}` expressions are automatically escaped
-2. **No `dangerouslySetInnerHTML`:** Confirmed - zero instances in the codebase
-3. **react-markdown:** Uses allowlist approach, disables raw HTML by default
-
-### Best Practice Recommendations
-
-1. **DOMPurify** - Only needed if using `dangerouslySetInnerHTML` (not applicable here)
-2. **URL Validation** - The `validateReferenceUrl` pattern in `InspectItemModal` should be applied consistently to all reference URL opening
-
-## Vulnerable Patterns Found
-
-### Pattern 1: Inconsistent URL Validation
-
-**Issue:** The URL validation in `InspectItemModal` is not applied in `TurnReferencesModal.tsx`.
-
-**Location:** [TurnReferencesModal.tsx#L262-L270](frontend/src/components/app/editor/TurnReferencesModal.tsx#L262-L270)
-
-```tsx
-<a
-    onClick={(e) => { e.preventDefault(); onOpenReference(r); }}
-    href={normalizeUrl(r.url)}
-    target="_blank"
-    rel="noreferrer"
->
-```
-
-**Risk:** The `onOpenReference` callback may open malicious URLs if the parent component doesn't validate.
-
-### Pattern 2: Missing `noopener` on External Links
-
-**Issue:** While `rel="noreferrer"` provides some protection, best practice is to include `noopener,noreferrer`.
-
-**Locations:**
-- [TurnReferencesModal.tsx#L268](frontend/src/components/app/editor/TurnReferencesModal.tsx#L268)
-- [SelectedTab.tsx#L64](frontend/src/components/app/ReferencesPanel/SelectedTab.tsx#L64)
-
-## Existing Mitigations
-
-1. **No `dangerouslySetInnerHTML`** - Primary XSS vector is absent
-2. **react-markdown sanitization** - Markdown content is sanitized
-3. **URL validation in InspectItemModal** - Partial protection for that component
-4. **External link confirmation** in InspectItemModal for untrusted domains
-
-## Recommendations
-
-### High Priority
-1. Extract `validateReferenceUrl` to a shared utility and use it consistently in:
-   - `TurnReferencesModal.tsx` `onOpenReference` handler
-   - `SelectedTab.tsx` `onOpenReference` handler
-   - Any component that opens reference URLs
-
-### Medium Priority
-2. Add `noopener` to all external link `rel` attributes
-3. Consider adding domain allowlisting for reference URLs at the application level
-
-### Low Priority
-4. No need to add DOMPurify unless `dangerouslySetInnerHTML` is introduced in the future
-
-## Conclusion
-
-**SA-565's concern about XSS in key paragraph rendering is a false positive.** The `<textarea>` element with React's controlled component pattern is inherently safe against XSS.
-
-However, the research uncovered an **inconsistent URL validation pattern** that should be addressed:
-- `InspectItemModal` has robust URL validation
-- `TurnReferencesModal` and `SelectedTab` do not validate URLs before opening
-
-The real vulnerability is **not XSS injection via content** but **potential for malicious URL schemes** if a compromised backend sends `javascript:` or `data:` URLs in reference data.
diff --git a/.github/prompts/init-agent-native.prompt.md b/.github/prompts/init-agent-native.prompt.md
new file mode 100644
index 0000000..487efa8
--- /dev/null
+++ b/.github/prompts/init-agent-native.prompt.md
@@ -0,0 +1,83 @@
+# Init Agent-Native
+
+> **Purpose:** Make this repository agent-native — structured so any AI coding agent can work in it effectively.
+> **Usage:** Run `/init-agent-native` in Copilot Chat, or feed this file to any agent CLI.
+
+---
+
+## Prompt
+
+You are setting up agent-native infrastructure for this repository. This gives AI agents clear instructions, observable execution, verifiable quality gates, and architectural guardrails.
+
+### Step 1: Bootstrap
+
+Run the bootstrap script from the agent-native skill:
+
+```bash
+bash .agents/skills/agent-native/scripts/bootstrap_harness.sh .
+```
+
+Read the output carefully — it tells you what to do next, in order.
+
+### Step 2: Build or identify the app
+
+If there's already an app in this repo, skip to Step 3.
+
+If you're also building the app, build it now. The bootstrap output assumes an app exists when you customize the harness.
+
+### Step 3: Follow the bootstrap instructions
+
+The bootstrap printed numbered steps. Follow them in order:
+
+1. **Read AGENTS.md** — it tells you what to customize and what conventions to follow
+2. **Read docs/OBSERVABILITY.md** — follow the Level Policy (2xx=INFO, 4xx=WARN, 5xx=ERROR) and implement hlog()/htrace() per the language examples
+3. **Customize docs/ARCHITECTURE.md** — replace ALL placeholders with real project info, add at least one lint rule for boundary enforcement
+4. **Fill in scripts/harness/*.sh** — the auto-detect should work for most projects, but verify each script runs successfully
+5. **Add observability to the app** — hlog() for structured logs, htrace() for request traces (see language examples in OBSERVABILITY.md)
+6. **Verify CI passes:** `make -f Makefile.harness ci`
+7. **Verify customization:** `scripts/verify_customized.sh .`
+8. **Run audit:** `scripts/audit_harness.sh .`
+
+### Step 4: Verify observability end-to-end
+
+Start the app, make a few requests including one that returns a 404, then check:
+
+```bash
+# Should see WARN entries for 4xx, ERROR for 5xx
+jq 'select(.level == "WARN" or .level == "ERROR")' .harness/logs.jsonl
+
+# Should see trace entries with duration_ms
+jq 'select(.duration_ms > 0)' .harness/traces.jsonl
+```
+
+### Step 5: Final check
+
+Run all three verification commands. All must pass:
+
+```bash
+make -f Makefile.harness ci
+scripts/verify_customized.sh .
+scripts/audit_harness.sh .
+```
+
+### What you're installing
+
+| Artifact | Purpose |
+|---|---|
+| `AGENTS.md` | Agent instructions — commands, constraints, conventions |
+| `docs/ARCHITECTURE.md` | Module boundaries + lint rules to enforce them |
+| `docs/OBSERVABILITY.md` | Structured logging convention (JSONL, level policy) |
+| `Makefile.harness` | Stable command surface: `make smoke`, `make check`, `make ci` |
+| `scripts/harness/` | Real scripts behind the Makefile (smoke, test, lint, typecheck) |
+| `.harness/` | Runtime observability data (logs.jsonl, traces.jsonl) |
+| `scripts/verify_customized.sh` | Catches leftover template boilerplate |
+| `scripts/audit_harness.sh` | Checks for structural gaps |
+
+### Key conventions
+
+- **Level Policy:** 2xx → INFO, 4xx → WARN, 5xx/exceptions → ERROR
+- **Two output files:** `.harness/logs.jsonl` (structured logs) and `.harness/traces.jsonl` (request traces with duration_ms)
+- **Required log fields:** `ts`, `level`, `msg`, `service`
+- **Required trace fields:** `trace_id`, `span_id`, `name`, `service`, `start`, `end`, `duration_ms`, `status`
+- **Minimum tests:** At least 5 meaningful tests covering core operations
+- **smoke.sh must:** start the server → poll health → make a request → kill the server
diff --git a/.github/prompts/init-agents-md.prompt.md b/.github/prompts/init-agents-md.prompt.md
new file mode 100644
index 0000000..cd094be
--- /dev/null
+++ b/.github/prompts/init-agents-md.prompt.md
@@ -0,0 +1,160 @@
+# Init AGENTS.md
+
+> **Purpose:** Generate a high-quality AGENTS.md for this repository.
+> **Usage:** Feed this prompt to any AI coding agent while in the repo root.
+> **Why this matters:** Vercel's evals showed a small, always-present AGENTS.md "docs index" hit 100% on Next.js API tasks — beating both no-docs (53%) and on-demand skills (53-79%). Passive context beats active retrieval because agents don't have to decide to look things up.
+
+---
+
+## Prompt
+
+You are initializing an AGENTS.md file for this repository. AGENTS.md is **persistent context** — it's loaded every time an agent works in this repo. Think of it as a **routing table for agent attention**: what to run, where truth lives, how to find things fast, and what not to do.
+
+### Step 1: Check for existing AGENTS.md
+
+```bash
+cat AGENTS.md 2>/dev/null
+```
+
+If an AGENTS.md already exists, you are **updating**, not creating from scratch:
+- Preserve any manually-added sections, rules, or gotchas
+- Update commands, project map, and docs index to reflect the current repo state
+- Remove references to files/directories that no longer exist
+- Add new docs, directories, or commands that have appeared since the last update
+- Keep the file under 8KB
+
+If no AGENTS.md exists, you are creating one fresh.
+
+### Step 2: Explore the repo
+
+Before writing anything, explore the project:
+
+```bash
+# What's here?
+find . -maxdepth 3 -type f | head -80
+cat README.md 2>/dev/null
+cat package.json 2>/dev/null || cat pyproject.toml 2>/dev/null || cat go.mod 2>/dev/null || cat *.csproj 2>/dev/null || cat Cargo.toml 2>/dev/null
+
+# How do you build/test/lint?
+cat Makefile 2>/dev/null || cat Makefile.harness 2>/dev/null
+cat .github/workflows/*.yml 2>/dev/null | head -100
+
+# Existing agent config?
+cat AGENTS.md 2>/dev/null
+cat .github/copilot-instructions.md 2>/dev/null
+cat .cursorrules 2>/dev/null
+cat .claude/settings.json 2>/dev/null
+
+# Docs?
+ls docs/ 2>/dev/null
+ls .next-docs/ 2>/dev/null
+ls ADR/ 2>/dev/null || ls docs/adr/ 2>/dev/null
+```
+
+### Step 3: Write or update AGENTS.md
+
+Write (or update) AGENTS.md in the repo root with these sections. **Keep it under 8KB.** Every line must earn its place — agents read this on every turn.
+
+#### Required Sections
+
+**1. Setup & Commands** (copy-pasteable, no prose)
+
+```markdown
+## Commands
+
+| Goal | Command |
+|---|---|
+| Install deps | `<command>` |
+| Dev server | `<command>` |
+| Lint | `<command>` |
+| Type check | `<command>` |
+| Test (all) | `<command>` |
+| Test (single) | `<command> <path>` |
+| Build | `<command>` |
+| CI-equivalent | `<command>` |
+```
+
+**2. Project Map** (what matters, where — max 15 lines)
+
+```markdown
+## Project Map
+
+- `src/` — application source
+  - `src/api/` — route handlers
+  - `src/lib/` — shared utilities
+  - `src/db/` — database layer (DO NOT import from api/)
+- `tests/` — test suite (mirrors src/ structure)
+- `docs/` — architecture decisions and API docs
+- `scripts/` — build and deployment scripts
+```
+
+Only list directories an agent would actually need to navigate. Skip obvious ones (node_modules, dist, .git).
+
+**3. Decision Rules** (when unsure, what to do)
+
+```markdown
+## Rules
+
+- Prefer retrieval-led reasoning over pre-training knowledge. When unsure about an API or pattern, check the docs index below before guessing.
+- Run `<lint command>` before committing. Treat lint failures as blocking.
+- Run `<test command>` after any logic change.
+- Do not modify files in `<protected paths>` without asking.
+- Keep modules within their boundaries (see Architecture below).
+```
+
+**4. Docs Index** (pointers, not walls of text)
+
+This is the key insight from Vercel's research: a compressed index mapping topics → files beats pasting docs inline. If the repo has local docs, ADRs, or a docs folder, build an index:
+
+```markdown
+## Docs Index
+
+When you need information on a topic, open the referenced file:
+
+| Topic | File |
+|---|---|
+| API route conventions | `docs/api-routes.md` |
+| Database migrations | `docs/migrations.md` |
+| Auth flow | `docs/auth.md` |
+| Error handling | `docs/OBSERVABILITY.md` |
+| Module boundaries | `docs/ARCHITECTURE.md` |
+| Deployment | `docs/deploy.md` |
+| ADR: chose Postgres over Mongo | `docs/adr/001-database.md` |
+```
+
+If the repo has framework docs locally (e.g., `.next-docs/`, `vendor/docs/`), index those too. Point to specific files, not directories.
+
+**5. Quality Bar** (what "done" means)
+
+```markdown
+## Quality Bar
+
+- All tests pass
+- No lint errors
+- No type errors
+- New endpoints include tests
+- Structured logging follows `docs/OBSERVABILITY.md` convention
+- PR description explains the "why"
+```
+
+#### Optional Sections (include if relevant)
+
+- **Architecture Boundaries** — if the project has layer rules (e.g., "store must not import from handlers"), state them explicitly
+- **Conventions** — naming patterns, file organization rules, import ordering
+- **Known Gotchas** — things that break in non-obvious ways
+
+### Step 4: Verify
+
+After writing AGENTS.md:
+
+1. Confirm it's under 8KB: `wc -c AGENTS.md`
+2. Every command listed actually works (run them)
+3. Every file referenced in the docs index actually exists
+4. No placeholder text remains
+
+### Design Principles (why these rules)
+
+- **Passive > Active**: Agents read AGENTS.md every turn. They have to *decide* to invoke skills/tools. Decision failure killed 56% of skill invocations in Vercel's evals.
+- **Pointers > Content**: An 8KB index that says "open this file for auth docs" beats pasting 40KB of auth docs inline. The agent fetches what it needs, when it needs it.
+- **Commands > Descriptions**: `npm run test -- --watch` is better than "you can run the tests in watch mode using the npm test script with the watch flag."
+- **Guardrails > Guidance**: "DO NOT import from api/ in the db layer" is better than "try to keep layers separate."
diff --git a/.github/prompts/update-agent-native-docs.prompt.md b/.github/prompts/update-agent-native-docs.prompt.md
new file mode 100644
index 0000000..3641456
--- /dev/null
+++ b/.github/prompts/update-agent-native-docs.prompt.md
@@ -0,0 +1,108 @@
+# Update Agent-Native Docs
+
+> **Purpose:** Sync agent-native documentation with the current state of the codebase.
+> **Usage:** Run `/update-agent-native-docs` in Copilot Chat periodically, after major refactors, or when docs feel stale.
+
+---
+
+## Prompt
+
+You are auditing and updating the agent-native documentation in this repository. These docs drift as code evolves — new modules get added, boundaries shift, commands change, and files get renamed. Your job is to make the docs match reality.
+
+### Step 1: Read current docs
+
+```bash
+cat AGENTS.md 2>/dev/null
+cat docs/ARCHITECTURE.md 2>/dev/null
+cat docs/OBSERVABILITY.md 2>/dev/null
+cat PLANS.md 2>/dev/null
+```
+
+### Step 2: Scan the actual codebase
+
+```bash
+# Current structure
+find . -maxdepth 3 -type f -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/dist/*' -not -path '*/obj/*' -not -path '*/bin/*' | sort
+
+# Current commands
+cat Makefile 2>/dev/null || cat Makefile.harness 2>/dev/null
+cat package.json 2>/dev/null | jq '.scripts' 2>/dev/null
+cat pyproject.toml 2>/dev/null | grep -A20 '\[tool\.' 2>/dev/null
+
+# Current imports / module dependencies (spot-check boundaries)
+# TypeScript:
+grep -r "from ['\"]" src/ --include="*.ts" 2>/dev/null | head -30
+# Python:
+grep -r "^from \|^import " src/ app/ --include="*.py" 2>/dev/null | head -30
+# Go:
+grep -r '"' --include="*.go" 2>/dev/null | grep -v test | grep import | head -20
+# C#:
+grep -r "^using " --include="*.cs" 2>/dev/null | grep -v obj | head -20
+
+# Existing docs referenced in AGENTS.md
+grep -oE '`[^`]+\.(md|txt|yml|yaml)`' AGENTS.md 2>/dev/null | sort -u | while read f; do
+  f=$(echo "$f" | tr -d '`')
+  [ -f "$f" ] && echo "✅ $f" || echo "❌ MISSING: $f"
+done
+```
+
+### Step 3: Update each doc
+
+For each file, diff what the doc says against what the code actually does. Fix discrepancies.
+
+#### AGENTS.md
+- **Commands table:** Do all commands still work? Run each one. Remove broken ones, add new ones.
+- **Project map:** Do all listed directories exist? Are there new important directories not listed?
+- **Docs index:** Do all referenced files exist? Are there new docs not indexed?
+- **Decision rules:** Still accurate? Any new rules needed from recent refactors?
+- **Quality bar:** Still matches CI/linting reality?
+
+#### docs/ARCHITECTURE.md
+- **Module list:** Does it match the actual directory structure?
+- **Execution flow:** Still accurate? Test by tracing a request through the code.
+- **Boundaries:** Are the stated rules actually followed? Spot-check with import grep above.
+- **Lint rules:** Do the configured lint rules still match the documented boundaries? Run them.
+- **Refactoring red flags:** Updated with any new patterns to watch for?
+
+#### docs/OBSERVABILITY.md
+- **hlog()/htrace() calls:** Still present in the codebase? Check with: `grep -r "hlog\|htrace\|HLog\|HTrace\|Harness" --include="*.ts" --include="*.py" --include="*.go" --include="*.cs" .`
+- **Level policy:** Still followed? Start the app, hit a 404, check `.harness/logs.jsonl`
+- **Field conventions:** Do actual log entries match the documented fields?
+- **Endpoints listed:** Do example endpoints in the doc still exist?
+
+#### PLANS.md
+- **Completed items:** Mark done items as done, remove if no longer relevant.
+- **Stale items:** Flag anything that hasn't been touched in the last iteration.
+
+### Step 4: Verify
+
+After updating:
+
+```bash
+# All commands in AGENTS.md still work
+make -f Makefile.harness ci 2>/dev/null || make ci 2>/dev/null
+
+# All referenced files exist
+grep -oE '`[^`]+\.(md|txt|yml|yaml)`' AGENTS.md 2>/dev/null | sort -u | while read f; do
+  f=$(echo "$f" | tr -d '`')
+  [ -f "$f" ] || echo "❌ MISSING: $f"
+done
+
+# Verify customization still passes
+scripts/verify_customized.sh . 2>/dev/null
+
+# AGENTS.md still under 8KB
+wc -c AGENTS.md
+```
+
+### Step 5: Summarize changes
+
+After updating, add a brief summary of what changed and why. This helps the next update cycle understand what drifted.
+
+### When to run this
+
+- After any major refactor (new modules, renamed directories, changed boundaries)
+- After adding new dependencies or removing old ones
+- After changing CI/build commands
+- Periodically (weekly or biweekly) as docs hygiene
+- When an agent reports confusion about project structure
diff --git a/.github/skills/agent-native/.gitignore b/.github/skills/agent-native/.gitignore
new file mode 100644
index 0000000..f31b3e2
--- /dev/null
+++ b/.github/skills/agent-native/.gitignore
@@ -0,0 +1,2 @@
+.DS_Store
+*.swp
diff --git a/.github/skills/agent-native/SKILL.md b/.github/skills/agent-native/SKILL.md
new file mode 100644
index 0000000..24dcca7
--- /dev/null
+++ b/.github/skills/agent-native/SKILL.md
@@ -0,0 +1,125 @@
+---
+name: harness-engineering-playbook
+description: Bootstrap any repository with agent-first harness engineering — deterministic commands, structured JSONL observability, compact docs, and strict boundaries. Use when setting up or improving a repo for coding agent workflows. Default observability is JSONL + jq (zero deps); devtel is the upgrade path.
+---
+
+# Harness Engineering Playbook
+
+Operationalize OpenAI's Harness Engineering practices in any repo so coding agents can run against it repeatedly and safely.
+
+## Quick Start
+
+```bash
+# Bootstrap a repo
+./scripts/bootstrap_harness.sh /path/to/repo
+
+# Or from within the repo
+./scripts/bootstrap_harness.sh .
+```
+
+This creates:
+- `AGENTS.md` — agent-facing docs (commands, constraints, guardrails, debugging loop)
+- `PLANS.md` — durable planning context for multi-step tasks
+- `docs/ARCHITECTURE.md` — module boundaries and data flow
+- `docs/OBSERVABILITY.md` — JSONL logging convention + jq query patterns
+- `Makefile.harness` — `make setup`, `make format`, `make smoke`, `make check`, `make ci`
+- `scripts/harness/` — deterministic wrappers (setup, format, smoke, test, lint, typecheck)
+- `.harness/` — JSONL telemetry files + jq query library (gitignored)
+- `.github/workflows/harness.yml` — CI integration
+
+## What To Load
+
+- `references/openai-harness-practices.md` — full practice-to-artifact mapping
+- `references/static-analysis.md` — linter recommendations per language + agent-friendly output patterns
+- `references/agent-hooks/` — lifecycle hooks for agent automation (format, lint, audit, safety)
+  - `copilot-hooks/` — GitHub Copilot (VS Code, CLI, coding agent) hook config + recipes
+  - _(add claude-code/, cursor/, etc. as needed)_
+- `references/browser-tools/` — browser automation and debugging for runtime verification
+  - `playwright-cli/` — CLI for navigate, interact, snapshot, screenshot (primary tool)
+  - `chrome-devtools-mcp/` — MCP server for deep debugging, perf profiling, network analysis
+- `references/rollout-checklist.md` — phased adoption for active repos
+- `assets/templates/` — all template files
+
+## The Nine Practices
+
+### 1. Make Easy To Do Hard Thing
+One command for every high-value task: `make setup`, `make format`, `make smoke`, `make check`, `make ci`. No manual prep.
+
+### 2. Communicate Actionable Constraints With Compact Docs
+`AGENTS.md` — short, concrete, command-first. Not narrative prose.
+
+### 3. Structure Codebase With Strict Boundaries And Flow
+`docs/ARCHITECTURE.md` — clear module boundaries, typed contracts, parse at edges.
+
+### 4. Build Observability In From Day 1
+**Default: JSONL + jq** (zero dependencies, any language).
+
+```bash
+# Your app appends structured JSON to .harness/logs.jsonl:
+{"ts":"...","level":"ERROR","msg":"timeout","service":"api","trace_id":"abc"}
+
+# Agent queries with jq:
+jq 'select(.level == "ERROR")' .harness/logs.jsonl
+jq --arg tid abc 'select(.trace_id == $tid)' .harness/logs.jsonl .harness/traces.jsonl
+
+# Pre-built queries in .harness/queries/:
+jq -f .harness/queries/errors.jq .harness/logs.jsonl
+jq --argjson threshold 500 -f .harness/queries/slow.jq .harness/traces.jsonl
+```
+
+**Upgrade to devtel** when you need joins, aggregations, auto-instrumentation, or 100K+ events:
+```bash
+npm install devtel && npx devtel init
+# import "devtel/init" in app entry point
+npx devtel logs --level error --last 5m
+```
+
+See `docs/OBSERVABILITY.md` for full logging convention, field names, and language examples.
+
+### 5. Optimize For Agent Flow, Not Human Flow
+`PLANS.md` for multi-step tasks. Front-load context so restarts are cheap.
+
+### 6. Bring Your Own Harness
+Repo-local wrappers in `scripts/harness/`. Same commands work locally and in CI.
+
+### 7. Prototype In Natural Language First
+Draft logic and tests in prose in `PLANS.md` before coding.
+
+### 8. Invest In Static Analysis And Linting
+`make check` (lint + typecheck) runs before `make test`. Fast-fail on static errors.
+
+See `references/static-analysis.md` for per-language tool recommendations (ESLint, Ruff, golangci-lint, Biome, MegaLinter for monorepos) and agent-friendly JSON output patterns.
+
+### 9. Manage Entropy
+`scripts/audit_harness.sh` catches docs drift, stale scripts, missing artifacts.
+
+## Workflow
+
+1. **Baseline** — inventory the repo's existing commands, CI, and pain points
+2. **Bootstrap** — `bootstrap_harness.sh` installs templates (won't overwrite existing files)
+3. **Read the output** — bootstrap prints next steps; follow them in order
+4. **Customize docs** — AGENTS.md, docs/ARCHITECTURE.md (replace ALL placeholders), docs/OBSERVABILITY.md
+5. **Install deps** — `make setup` auto-detects and installs dependencies (override with `HARNESS_SETUP_CMD`)
+6. **Fill in scripts** — `scripts/harness/*.sh` with real project commands (not stubs)
+7. **Add observability** — implement `hlog()` per the language example in docs/OBSERVABILITY.md; follow the Level Policy
+8. **Validate** — `make -f Makefile.harness ci` must pass; `scripts/verify_customized.sh .` catches leftover boilerplate; `scripts/audit_harness.sh .` checks for gaps
+9. **Iterate** — observe an agent run, patch gaps, re-audit
+
+## Agent Verify Loop
+
+After any code change, the agent should:
+
+```bash
+make ci                                          # lint + typecheck + test
+jq 'select(.level == "ERROR" or .level == "WARN")' .harness/logs.jsonl | tail -5   # check for runtime errors
+jq 'select(.duration_ms > 1000)' .harness/traces.jsonl | tail -5  # check for regressions
+```
+
+If errors or slow traces appear → fix and re-run. When clean → commit.
+
+## Adaptation Rules
+
+- Preserve existing project conventions; replace templates incrementally
+- Don't overwrite user-authored files without explicit approval
+- Keep command names stable; change internals behind wrappers
+- Favor deterministic, scriptable workflows over ad-hoc interactive steps
diff --git a/.github/skills/agent-native/agents/openai.yaml b/.github/skills/agent-native/agents/openai.yaml
new file mode 100644
index 0000000..2a2f605
--- /dev/null
+++ b/.github/skills/agent-native/agents/openai.yaml
@@ -0,0 +1,4 @@
+interface:
+  display_name: "Harness Engineering Playbook"
+  short_description: "OpenAI harness patterns for agent workflows"
+  default_prompt: "Analyze this repository and set up harness engineering workflows, docs, and automation following OpenAI Harness Engineering principles."
diff --git a/.github/skills/agent-native/assets/templates/.github/workflows/harness.yml b/.github/skills/agent-native/assets/templates/.github/workflows/harness.yml
new file mode 100644
index 0000000..5726595
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/.github/workflows/harness.yml
@@ -0,0 +1,22 @@
+name: Harness CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  harness:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      # Add language/runtime setup steps as needed for this repository.
+      # Examples:
+      # - actions/setup-node@v4
+      # - actions/setup-python@v5
+      # - dtolnay/rust-toolchain@stable
+
+      - name: Run harness pipeline
+        run: make ci
diff --git a/.github/skills/agent-native/assets/templates/.github/workflows/nightly-harness-audit.yml b/.github/skills/agent-native/assets/templates/.github/workflows/nightly-harness-audit.yml
new file mode 100644
index 0000000..6056e8d
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/.github/workflows/nightly-harness-audit.yml
@@ -0,0 +1,21 @@
+name: Nightly Harness Audit
+
+on:
+  schedule:
+    - cron: "0 4 * * *"
+  workflow_dispatch:
+
+jobs:
+  audit:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      # Add runtime setup steps required for this repository.
+
+      - name: Baseline harness audit
+        run: scripts/audit_harness.sh .
+
+      - name: Entropy check
+        run: scripts/harness/entropy_check.sh
diff --git a/.github/skills/agent-native/assets/templates/AGENTS.md b/.github/skills/agent-native/assets/templates/AGENTS.md
new file mode 100644
index 0000000..2ef07aa
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/AGENTS.md
@@ -0,0 +1,89 @@
+# AGENTS.md
+
+> **CUSTOMIZE THIS FILE** — replace all `<placeholder>` values with real project info.
+> Read `docs/OBSERVABILITY.md` and `docs/ARCHITECTURE.md` before writing code.
+
+## Project Overview
+
+- Project: `<project-name>`
+- Primary runtime(s): `<runtime>`
+- Main entrypoint(s): `<entrypoints>`
+
+## Harness Commands
+
+Run from repository root:
+
+| Goal | Command |
+|---|---|
+| Install dependencies | `make setup` |
+| Auto-format code | `make format` |
+| Fast sanity check | `make smoke` |
+| Static checks | `make check` |
+| Full test suite | `make test` |
+| CI-equivalent local run | `make ci` |
+
+## Debugging Loop
+
+When `make ci` fails, don't just re-run and hope. Read the error, fix the cause, then verify.
+
+1. **Identify which stage failed** — `make ci` runs `smoke → check (lint + typecheck) → test`. Check which one errored.
+
+2. **Lint fails** → run `make format` first (auto-fixes formatting), then re-run `make check`. If it still fails, read the linter output — it tells you exactly which file and rule.
+
+3. **Typecheck fails** → read the error output carefully. It gives you the file, line number, and expected types. Fix the specific issue; don't suppress the error.
+
+4. **Test fails** → check `.harness/logs.jsonl` for runtime errors (`jq 'select(.level == "ERROR")' .harness/logs.jsonl`). Read the test output for assertion details — which test, what was expected vs actual.
+
+5. **Smoke fails** → the app didn't start or respond. Common causes:
+   - Missing dependencies → run `make setup`
+   - Port conflict → check if something else is on the port
+   - Wrong start command → check `HARNESS_SMOKE_START_CMD` or `scripts/harness/smoke.sh`
+   - For non-server projects, set `HARNESS_SMOKE_MODE` to `cli`, `library`, or `auto`
+
+6. **General** — read the actual error message. Fix the root cause. Then run `make ci` again to confirm the fix didn't break something else.
+
+## Constraints And Guardrails
+
+- Prefer deterministic scripts over interactive/manual steps.
+- Keep command names stable (`setup`, `format`, `smoke`, `check`, `test`, `ci`).
+- Update docs and scripts in the same change when workflow behavior changes.
+- Avoid side effects outside the repo unless explicitly required.
+
+## Architecture Boundaries
+
+- Parse and validate external data at boundaries.
+- Keep internal data models typed and normalized.
+- Keep each module focused on one responsibility.
+- **Enforce boundaries with lint rules** (see `docs/ARCHITECTURE.md` for examples).
+- Customize `docs/ARCHITECTURE.md` with real module boundaries and execution flow — do not leave template boilerplate.
+
+## Observability Convention
+
+**Read `docs/OBSERVABILITY.md` for the full logging convention.**
+
+Key rules:
+- All structured logs write to `.harness/logs.jsonl`, traces to `.harness/traces.jsonl`
+- Generate a `trace_id` per request and propagate it through context
+- Track `duration_ms` per request
+- **Level policy:** 2xx → INFO, 4xx → WARN, 5xx/exceptions → ERROR
+- Required fields: `ts`, `level`, `msg`, `service`
+- Customize `docs/OBSERVABILITY.md` with project-specific endpoint examples
+
+## Execution Plans
+
+- For tasks expected to exceed ~30 minutes, create/update `PLANS.md` before coding.
+- Track scope, constraints, milestones, and verification steps.
+
+## Static Analysis And Quality Gates
+
+- Run `make check` before `make test`.
+- Run `make ci` before pushing large refactors.
+- Treat lint/type failures as blocking.
+- **Minimum test coverage:** at least 5 meaningful tests covering core operations (CRUD, error cases, edge cases). Tests must verify real behavior, not just "assert True".
+
+## Entropy Management
+
+- Remove stale scripts/docs quickly.
+- Keep templates and real workflows in sync.
+- **Before considering setup complete, run:** `scripts/verify_customized.sh .` — it catches leftover template placeholders, stub scripts, and missing observability wiring.
+- Run periodic harness audits: `scripts/audit_harness.sh .`
diff --git a/.github/skills/agent-native/assets/templates/Makefile.harness b/.github/skills/agent-native/assets/templates/Makefile.harness
new file mode 100644
index 0000000..6b28fd6
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/Makefile.harness
@@ -0,0 +1,41 @@
+.PHONY: setup format smoke test lint typecheck check ci verify observe
+
+setup:
+	@./scripts/harness/setup.sh
+
+format:
+	@./scripts/harness/format.sh
+
+smoke:
+	@./scripts/harness/smoke.sh
+
+test:
+	@./scripts/harness/test.sh
+
+lint:
+	@./scripts/harness/lint.sh
+
+typecheck:
+	@./scripts/harness/typecheck.sh
+
+check: lint typecheck
+
+ci: smoke check test
+
+# Agent verify loop: run checks + inspect runtime telemetry
+verify: ci
+	@echo "--- Runtime Errors ---"
+	@jq 'select(.level == "ERROR")' .harness/logs.jsonl 2>/dev/null | tail -5 || true
+	@echo "--- Slow Requests (>1s) ---"
+	@jq 'select(.duration_ms > 1000)' .harness/traces.jsonl 2>/dev/null | tail -5 || true
+
+# Quick observability check (no test run)
+observe:
+	@echo "=== Errors ==="
+	@jq -s 'map(select(.level == "ERROR")) | length' .harness/logs.jsonl 2>/dev/null || echo "0"
+	@echo "=== Slow (>500ms) ==="
+	@jq -s 'map(select(.duration_ms > 500)) | length' .harness/traces.jsonl 2>/dev/null || echo "0"
+	@echo "=== Log Lines ==="
+	@wc -l .harness/logs.jsonl 2>/dev/null | awk '{print $$1}' || echo "0"
+	@echo "=== Trace Lines ==="
+	@wc -l .harness/traces.jsonl 2>/dev/null | awk '{print $$1}' || echo "0"
diff --git a/.github/skills/agent-native/assets/templates/PLANS.md b/.github/skills/agent-native/assets/templates/PLANS.md
new file mode 100644
index 0000000..ce8afae
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/PLANS.md
@@ -0,0 +1,54 @@
+# PLANS.md
+
+Use this file for multi-step work where durable context matters.
+
+## Objective
+
+- Outcome:
+- Why it matters:
+- Non-goals:
+
+## Constraints
+
+- Runtime/tooling constraints:
+- Security/compliance constraints:
+- Performance/reliability constraints:
+
+## Context Snapshot
+
+- Relevant files/modules:
+- Existing commands/workflows:
+- Known risks:
+
+## Execution Plan
+
+1. Step:
+   - Expected output:
+   - Verification:
+2. Step:
+   - Expected output:
+   - Verification:
+3. Step:
+   - Expected output:
+   - Verification:
+
+## Checkpoints
+
+- [ ] Baseline captured
+- [ ] Implementation complete
+- [ ] Static checks passed
+- [ ] Tests passed
+- [ ] Docs updated
+
+## Decision Log
+
+- Date:
+  - Decision:
+  - Reason:
+  - Alternatives considered:
+
+## Final Verification
+
+- Commands run:
+- Key outputs:
+- Follow-up tasks:
diff --git a/.github/skills/agent-native/assets/templates/docs/ARCHITECTURE.md b/.github/skills/agent-native/assets/templates/docs/ARCHITECTURE.md
new file mode 100644
index 0000000..148340b
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/ARCHITECTURE.md
@@ -0,0 +1,159 @@
+# Architecture
+
+<!-- ⚠️  CUSTOMIZE THIS FILE — replace every section below with your project's real details.
+     See the example at the bottom for what a filled-in version looks like.
+     A grader/auditor will FAIL this file if it still contains placeholder text. -->
+
+## Purpose
+
+_Replace: What does this service/app do? One sentence._
+
+## Boundaries
+
+| Boundary | Input | Output | Owner |
+|---|---|---|---|
+| _Replace_ | _HTTP request / CLI args / event_ | _DTO / response / side effect_ | _module/package_ |
+
+## Data Shape Contracts
+
+- Parse and validate external data at boundaries.
+- Convert to internal typed models before crossing module boundaries.
+- Keep boundary transformation logic centralized and testable.
+
+## Module Ownership Rules
+
+- One primary responsibility per module.
+- No cross-layer shortcuts without explicit architecture update.
+- New modules require ownership and boundary documentation.
+
+## Execution Flow
+
+_Replace each step with your actual request lifecycle:_
+
+1. Entry: _e.g., HTTP request hits router_
+2. Boundary parse/validate: _e.g., middleware validates auth + body schema_
+3. Core execution: _e.g., service layer applies business logic_
+4. Persistence/output: _e.g., write to DB / return response_
+5. Event/log emission: _e.g., hlog() writes to .harness/logs.jsonl_
+
+## Enforcing Boundaries With Lint Rules
+
+Architecture docs rot. Lint rules don't. Encode your boundaries as static analysis rules so agents (and humans) get instant feedback when they violate them.
+
+### Import/Dependency Restrictions
+
+Prevent modules from importing across boundaries they shouldn't cross.
+
+**TypeScript (ESLint `no-restricted-imports` / `import/no-restricted-paths`):**
+```jsonc
+// eslint.config.js — block store from importing route-layer code
+{
+  "rules": {
+    "no-restricted-imports": ["error", {
+      "patterns": [{
+        "group": ["../routes/*", "../middleware/*"],
+        "message": "Store layer must not import from routes or middleware."
+      }]
+    }]
+  }
+}
+```
+
+**Python (Ruff / `import-linter`):**
+```toml
+# pyproject.toml — using import-linter
+[tool.importlinter]
+root_packages = ["app"]
+
+[[tool.importlinter.contracts]]
+name = "Domain must not import from API layer"
+type = "forbidden"
+source_modules = ["app.domain"]
+forbidden_modules = ["app.routes", "app.middleware"]
+```
+
+**Go (depguard via golangci-lint):**
+```yaml
+# .golangci.yml
+linters:
+  enable:
+    - depguard
+linters-settings:
+  depguard:
+    rules:
+      store-boundary:
+        deny:
+          - pkg: "harness-test-go/handlers"
+            desc: "Store package must not import handlers"
+        files:
+          - "**/store/**"
+```
+
+**C# (Roslyn analyzers / NDepend / ArchUnitNET):**
+```csharp
+// Using ArchUnitNET in a test:
+[Fact]
+public void Domain_Should_Not_Reference_Controllers()
+{
+    Types().That().ResideInNamespace("App.Domain")
+        .Should().NotDependOnAny(
+            Types().That().ResideInNamespace("App.Controllers"))
+        .Check(Architecture);
+}
+```
+
+### Custom Rules for Your Project
+
+_Replace: Add project-specific lint rules here. Common patterns:_
+
+- **No direct DB access outside the store layer** — restrict ORM/SQL imports to `store/` or `repository/`
+- **No HTTP calls in domain logic** — restrict `fetch`/`requests`/`HttpClient` to service layer
+- **Config must flow through injection** — ban `process.env` / `os.environ` reads outside config module
+- **No cross-feature imports** — in monorepos, features can't import from sibling features directly
+
+### Wiring Into the Harness
+
+Add boundary lint rules to `scripts/harness/lint.sh` so they run in `make check` and `make ci`:
+
+```bash
+# In scripts/harness/lint.sh — add after standard linting
+echo "Checking architectural boundaries..."
+# TypeScript: eslint already covers it if rules are in config
+# Python: import-linter
+import-linter --config pyproject.toml
+# Go: golangci-lint already covers it if depguard is enabled
+# C#: dotnet test --filter "Category=Architecture"
+```
+
+## Refactor Checklist
+
+- [ ] Boundary contracts unchanged or versioned.
+- [ ] Ownership map still accurate.
+- [ ] Integration tests cover boundary paths.
+- [ ] Documentation updated in same change.
+
+---
+
+## Example (delete this section after customizing)
+
+Below is what a filled-in architecture doc looks like for a task tracker API:
+
+```markdown
+## Purpose
+REST API for managing tasks (CRUD) with in-memory storage and JSONL observability.
+
+## Boundaries
+| Boundary | Input | Output | Owner |
+|---|---|---|---|
+| HTTP Router | HTTP request | route match + params | routes.py / router.ts |
+| Validation | raw JSON body | typed Task model | models.py / types.ts |
+| Store | Task model | persisted Task (dict/Map) | store.py / store.ts |
+| Observability | request context | JSONL log/trace lines | middleware (hlog) |
+
+## Execution Flow
+1. Entry: HTTP request → framework router
+2. Boundary: request body parsed into typed model, 422 on invalid
+3. Core: store CRUD operation (create/read/update/delete)
+4. Response: JSON serialization of result, appropriate status code
+5. Observability: middleware logs request completion + duration to .harness/
+```
diff --git a/.github/skills/agent-native/assets/templates/docs/OBSERVABILITY.md b/.github/skills/agent-native/assets/templates/docs/OBSERVABILITY.md
new file mode 100644
index 0000000..3bd300f
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/OBSERVABILITY.md
@@ -0,0 +1,325 @@
+# Observability
+
+## Goal
+
+Give coding agents runtime visibility — errors, slow requests, trace correlation — without requiring a full observability stack.
+
+## Default: JSONL + jq (Zero Dependencies)
+
+All telemetry writes to `.harness/` as append-only JSONL files:
+
+```
+.harness/
+  logs.jsonl        # Structured log lines
+  traces.jsonl      # Span records (start, end, trace_id, parent_id)
+  metrics.jsonl     # Metric data points
+```
+
+### Log Format
+
+One JSON object per line. Required fields:
+
+```json
+{"ts":"2026-02-28T14:05:32.123Z","level":"ERROR","msg":"Connection timeout","service":"api","trace_id":"abc123","duration_ms":5012}
+```
+
+| Field | Required | Description |
+|---|---|---|
+| `ts` | ✅ | ISO 8601 timestamp |
+| `level` | ✅ | TRACE, DEBUG, INFO, WARN, ERROR, FATAL (see level policy below) |
+| `msg` | ✅ | Human-readable message |
+| `service` | ✅ | Service/component name |
+| `trace_id` | | Correlation ID across events |
+| `span_id` | | Span identifier |
+| `parent_id` | | Parent span for trace trees |
+| `duration_ms` | | Request/operation duration |
+| `status` | | ok, error |
+| `error` | | Error message/stack (when level=ERROR) |
+| `*` | | Any additional fields |
+
+### Level Policy
+
+Use the right level for HTTP responses and application events:
+
+| Condition | Level | Example |
+|---|---|---|
+| Success (2xx) | `INFO` | `{"level":"INFO","msg":"GET /tasks 200","status":"ok"}` |
+| Client error (4xx) | `WARN` | `{"level":"WARN","msg":"GET /tasks/999 404","status":"error","error":"not found"}` |
+| Server error (5xx) | `ERROR` | `{"level":"ERROR","msg":"POST /tasks 500","status":"error","error":"db connection refused"}` |
+| Unhandled exception | `ERROR` | `{"level":"ERROR","msg":"unhandled exception","error":"TypeError: ..."}` |
+| Slow request (>threshold) | `WARN` | `{"level":"WARN","msg":"slow request","duration_ms":3200}` |
+
+**Why this matters:** Agents use `jq 'select(.level == "ERROR")'` to find problems. If 404s are logged as INFO, agents miss application failures. If 404s are logged as ERROR, agents drown in noise from expected "not found" responses. WARN is the compromise — visible in targeted queries (`level == "WARN" or level == "ERROR"`) without polluting the error channel.
+
+> **Rule of thumb:** If the *caller* made a mistake → WARN. If *your code* broke → ERROR.
+
+### Trace Format
+
+```json
+{"trace_id":"abc123","span_id":"span1","parent_id":null,"name":"GET /api/users","service":"api","start":"2026-02-28T14:05:32.000Z","end":"2026-02-28T14:05:32.245Z","duration_ms":245,"status":"ok"}
+```
+
+### Metric Format (optional)
+
+Metrics are **optional** for dev-time use. Traces already capture `duration_ms` per request, which is sufficient for performance analysis. Use metrics only if you need counters or gauges (e.g., queue depth, cache hit rate) that don't map to individual requests.
+
+```json
+{"ts":"2026-02-28T14:05:32.000Z","name":"http.duration","service":"api","value":245,"unit":"ms"}
+```
+
+## Querying (jq)
+
+Common queries for agents:
+
+```bash
+# Recent errors
+jq 'select(.level == "ERROR")' .harness/logs.jsonl
+
+# Errors AND warnings (catches 4xx + 5xx)
+jq 'select(.level == "ERROR" or .level == "WARN")' .harness/logs.jsonl
+
+# Errors in the last 5 minutes (agent computes cutoff timestamp)
+jq --arg since "2026-02-28T14:00:00Z" 'select(.level == "ERROR" and .ts >= $since)' .harness/logs.jsonl
+
+# Slow requests (>500ms)
+jq 'select(.duration_ms > 500)' .harness/traces.jsonl
+
+# All events for a specific trace
+jq --arg tid "abc123" 'select(.trace_id == $tid)' .harness/logs.jsonl .harness/traces.jsonl
+
+# Error count by service
+jq -s 'map(select(.level == "ERROR")) | group_by(.service) | map({service: .[0].service, count: length})' .harness/logs.jsonl
+
+# Unique error messages
+jq -s 'map(select(.level == "ERROR")) | map(.msg) | unique' .harness/logs.jsonl
+```
+
+### Pre-Built Query Library
+
+Copy `.harness/queries/` into your repo for common agent queries:
+
+```bash
+# .harness/queries/errors.jq
+select(.level == "ERROR")
+
+# .harness/queries/slow.jq — pass --argjson threshold 500
+select(.duration_ms > $threshold)
+
+# .harness/queries/trace.jq — pass --arg tid <trace_id>
+select(.trace_id == $tid)
+```
+
+Usage: `jq -f .harness/queries/errors.jq .harness/logs.jsonl`
+
+## Logging in Your App
+
+### Node.js (zero deps)
+
+```typescript
+import { appendFileSync, mkdirSync } from 'fs';
+import { randomUUID } from 'crypto';
+
+mkdirSync('.harness', { recursive: true });
+
+function hlog(entry: Record<string, unknown>) {
+  const line = JSON.stringify({ ts: new Date().toISOString(), ...entry });
+  appendFileSync('.harness/logs.jsonl', line + '\n');
+}
+
+function htrace(entry: Record<string, unknown>) {
+  const line = JSON.stringify(entry);
+  appendFileSync('.harness/traces.jsonl', line + '\n');
+}
+
+// --- Logging (writes to .harness/logs.jsonl) ---
+hlog({ level: 'INFO', msg: 'GET /tasks 200', service: 'api', duration_ms: 12 });
+hlog({ level: 'WARN', msg: 'GET /tasks/999 404', service: 'api', status: 'error', error: 'not found' });
+hlog({ level: 'ERROR', msg: 'POST /tasks 500', service: 'api', status: 'error', error: err.message });
+
+// --- Tracing (writes to .harness/traces.jsonl) ---
+// Call htrace() in middleware after each request completes:
+const traceId = randomUUID();
+const start = new Date();
+// ... handle request ...
+const end = new Date();
+htrace({
+  trace_id: traceId, span_id: randomUUID(), parent_id: null,
+  name: `${req.method} ${req.path}`, service: 'api',
+  start: start.toISOString(), end: end.toISOString(),
+  duration_ms: end.getTime() - start.getTime(),
+  status: res.statusCode < 400 ? 'ok' : 'error'
+});
+```
+
+### Python (zero deps)
+
+```python
+import json, datetime, pathlib, uuid
+
+pathlib.Path(".harness").mkdir(exist_ok=True)
+
+def hlog(**kwargs):
+    entry = {"ts": datetime.datetime.utcnow().isoformat() + "Z", **kwargs}
+    pathlib.Path(".harness/logs.jsonl").open("a").write(json.dumps(entry) + "\n")
+
+def htrace(**kwargs):
+    pathlib.Path(".harness/traces.jsonl").open("a").write(json.dumps(kwargs) + "\n")
+
+# --- Logging ---
+hlog(level="INFO", msg="GET /tasks 200", service="api", duration_ms=8)
+hlog(level="WARN", msg="GET /tasks/999 404", service="api", status="error", error="not found")
+hlog(level="ERROR", msg="POST /tasks 500", service="api", status="error", error="db timeout")
+
+# --- Tracing (call in middleware after request completes) ---
+htrace(
+    trace_id=str(uuid.uuid4()), span_id=str(uuid.uuid4()), parent_id=None,
+    name="GET /tasks", service="api",
+    start=start_time.isoformat() + "Z", end=end_time.isoformat() + "Z",
+    duration_ms=round((end_time - start_time).total_seconds() * 1000),
+    status="ok"  # or "error" for 4xx/5xx
+)
+```
+
+### Go
+
+```go
+package harness
+
+import (
+    "encoding/json"
+    "os"
+    "time"
+)
+
+func HLog(fields map[string]any) {
+    fields["ts"] = time.Now().UTC().Format(time.RFC3339Nano)
+    line, _ := json.Marshal(fields)
+    f, _ := os.OpenFile(".harness/logs.jsonl", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
+    defer f.Close()
+    f.Write(append(line, '\n'))
+}
+
+func HTrace(fields map[string]any) {
+    line, _ := json.Marshal(fields)
+    f, _ := os.OpenFile(".harness/traces.jsonl", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
+    defer f.Close()
+    f.Write(append(line, '\n'))
+}
+
+// --- Logging ---
+// harness.HLog(map[string]any{"level": "WARN", "msg": "GET /notes/999 404", "service": "api", "error": "not found"})
+
+// --- Tracing (call in middleware after request completes) ---
+// harness.HTrace(map[string]any{
+//     "trace_id": traceID, "span_id": spanID, "parent_id": nil,
+//     "name": "GET /notes", "service": "api",
+//     "start": start.Format(time.RFC3339Nano), "end": end.Format(time.RFC3339Nano),
+//     "duration_ms": end.Sub(start).Milliseconds(), "status": "ok",
+// })
+```
+
+### C#
+
+```csharp
+using System.Text.Json;
+
+public static class Harness
+{
+    private static readonly string LogPath = ".harness/logs.jsonl";
+    private static readonly string TracePath = ".harness/traces.jsonl";
+
+    static Harness() => Directory.CreateDirectory(".harness");
+
+    // --- Logging (writes to .harness/logs.jsonl) ---
+    public static void Log(string level, string msg, string service, Dictionary<string, object>? extra = null)
+    {
+        var entry = new Dictionary<string, object>
+        {
+            ["ts"] = DateTime.UtcNow.ToString("O"),
+            ["level"] = level,
+            ["msg"] = msg,
+            ["service"] = service
+        };
+        if (extra != null) foreach (var kv in extra) entry[kv.Key] = kv.Value;
+        File.AppendAllText(LogPath, JsonSerializer.Serialize(entry) + "\n");
+    }
+
+    // --- Tracing (writes to .harness/traces.jsonl) ---
+    public static void Trace(string traceId, string name, string service,
+        DateTime start, DateTime end, string status, Dictionary<string, object>? extra = null)
+    {
+        var entry = new Dictionary<string, object>
+        {
+            ["trace_id"] = traceId,
+            ["span_id"] = Guid.NewGuid().ToString(),
+            ["parent_id"] = null!,
+            ["name"] = name,
+            ["service"] = service,
+            ["start"] = start.ToString("O"),
+            ["end"] = end.ToString("O"),
+            ["duration_ms"] = (end - start).TotalMilliseconds,
+            ["status"] = status
+        };
+        if (extra != null) foreach (var kv in extra) entry[kv.Key] = kv.Value;
+        File.AppendAllText(TracePath, JsonSerializer.Serialize(entry) + "\n");
+    }
+}
+
+// --- In middleware, after the request completes: ---
+// var traceId = Guid.NewGuid().ToString();
+// var start = DateTime.UtcNow;
+// await next(context);
+// var end = DateTime.UtcNow;
+// var status = context.Response.StatusCode < 400 ? "ok" : "error";
+// Harness.Log(level, $"{method} {path} {statusCode}", "api", new() { ["trace_id"] = traceId, ["duration_ms"] = (end - start).TotalMilliseconds });
+// Harness.Trace(traceId, $"{method} {path}", "api", start, end, status);
+```
+
+### Any Language
+
+Append JSON lines to `.harness/logs.jsonl` (logs) and `.harness/traces.jsonl` (traces). That's it.
+
+## Verify Script Integration
+
+`make verify` should include an observability check:
+
+```bash
+# In scripts/harness/smoke.sh or verify script:
+ERRORS=$(jq -s 'map(select(.level == "ERROR")) | length' .harness/logs.jsonl 2>/dev/null || echo 0)
+if [ "$ERRORS" -gt 0 ]; then
+  echo "⚠️  $ERRORS errors found in .harness/logs.jsonl"
+  jq 'select(.level == "ERROR")' .harness/logs.jsonl | tail -5
+fi
+```
+
+## Graduation: devtel (When You Outgrow JSONL)
+
+When you need:
+- Cross-signal joins (logs ↔ traces by trace_id)
+- Aggregations (p95 latency, error rates)
+- 100K+ events per session
+- Auto-instrumentation (HTTP, DB, etc. without manual logging)
+
+Upgrade to [devtel](https://github.com/bertclaws/devtel):
+
+```bash
+npm install devtel
+npx devtel init
+
+# Replace manual hlog() calls with:
+# import "devtel/init";
+# (auto-instruments everything via OpenTelemetry)
+
+# Query with SQL instead of jq:
+npx devtel logs --level error --last 5m
+npx devtel traces --slow 500ms
+npx devtel query "SELECT * FROM spans JOIN logs USING (trace_id)"
+```
+
+## Rules
+
+- Keep field names stable — agents and scripts depend on them.
+- Emit structured JSON, never unstructured text logs.
+- Include trace_id on anything that crosses a boundary.
+- `.harness/` is gitignored. Ephemeral by default.
+- Redact secrets and PII.
diff --git a/.github/skills/agent-native/assets/templates/docs/control/ACTUATORS.md b/.github/skills/agent-native/assets/templates/docs/control/ACTUATORS.md
new file mode 100644
index 0000000..fa57117
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/ACTUATORS.md
@@ -0,0 +1,25 @@
+# Actuators
+
+Define the actions agents are allowed to perform to move the system toward setpoints.
+
+## Actuation Surface
+
+- Code edits
+- Test and build execution
+- Script/template updates
+- CI workflow adjustments
+- Documentation updates
+
+## Safety Boundaries
+
+- Protected branches/rules:
+- Restricted commands:
+- Approval-required actions:
+
+## Action Catalog
+
+| Action | Preconditions | Postconditions | Rollback |
+|---|---|---|---|
+| patch code | tests defined | checks green | revert commit |
+| update harness docs | doc owner review | docs aligned | restore prior doc |
+| tune CI workflow | CI dry run | stable runtime | revert workflow |
diff --git a/.github/skills/agent-native/assets/templates/docs/control/CONTROLLER.md b/.github/skills/agent-native/assets/templates/docs/control/CONTROLLER.md
new file mode 100644
index 0000000..6955f81
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/CONTROLLER.md
@@ -0,0 +1,28 @@
+# Controller
+
+Describe the policy and logic that decides corrective actions.
+
+## Control Policy
+
+- Primary control objective:
+- Secondary objectives:
+- Priority order:
+
+## Control Inputs
+
+- Required signals:
+- Input freshness constraints:
+- Input confidence thresholds:
+
+## Control Actions
+
+- Tighten constraints (docs/scripts/gates)
+- Adjust evaluation scope
+- Escalate to human review
+- Trigger refactor cleanup
+
+## Escalation Rules
+
+- Escalation trigger:
+- Escalation owner:
+- Maximum autonomous retries:
diff --git a/.github/skills/agent-native/assets/templates/docs/control/CONTROL_SYSTEM.md b/.github/skills/agent-native/assets/templates/docs/control/CONTROL_SYSTEM.md
new file mode 100644
index 0000000..211d197
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/CONTROL_SYSTEM.md
@@ -0,0 +1,27 @@
+# Control System Model
+
+## Purpose
+
+Use this document to keep the repository's autonomous development loop explicit and stable.
+
+## System Definition
+
+- Setpoint:
+- Plant:
+- Controller:
+- Actuators:
+- Sensors:
+- Feedback channels:
+- Disturbances:
+
+## Maturity Targets
+
+- Stability target:
+- Adaptation target:
+- Recovery target:
+
+## Review Cadence
+
+- Weekly harness review owner:
+- Monthly architecture review owner:
+- Entropy cleanup cadence:
diff --git a/.github/skills/agent-native/assets/templates/docs/control/ENTROPY.md b/.github/skills/agent-native/assets/templates/docs/control/ENTROPY.md
new file mode 100644
index 0000000..0abd52d
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/ENTROPY.md
@@ -0,0 +1,28 @@
+# Entropy Management
+
+Define recurring cleanup actions that prevent harness drift.
+
+## Drift Sources
+
+- Stale docs after workflow changes
+- Dead scripts no longer called by CI
+- Flaky tests ignored over time
+- Inconsistent logging field names
+
+## Entropy Controls
+
+- Weekly harness audit
+- Monthly docs/script alignment review
+- Periodic flaky-test triage
+- Architectural boundary checks after refactors
+
+## Required Commands
+
+- `scripts/harness/entropy_check.sh`
+- `scripts/audit_harness.sh .`
+
+## Ownership
+
+- Primary owner:
+- Backup owner:
+- Review cadence:
diff --git a/.github/skills/agent-native/assets/templates/docs/control/FEEDBACK_LOOP.md b/.github/skills/agent-native/assets/templates/docs/control/FEEDBACK_LOOP.md
new file mode 100644
index 0000000..34ea7b3
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/FEEDBACK_LOOP.md
@@ -0,0 +1,23 @@
+# Feedback Loop
+
+Define how observations produce corrective actions.
+
+## Loop Steps
+
+1. Measure: capture sensor outputs.
+2. Compare: compute error against setpoints.
+3. Decide: choose control action.
+4. Act: apply change.
+5. Verify: re-measure and close the loop.
+
+## Control Frequency
+
+- Fast loop (per change):
+- Daily loop:
+- Weekly loop:
+
+## Error Budget Policy
+
+- Error budget metric:
+- Budget window:
+- Budget exhaustion response:
diff --git a/.github/skills/agent-native/assets/templates/docs/control/SENSORS.md b/.github/skills/agent-native/assets/templates/docs/control/SENSORS.md
new file mode 100644
index 0000000..29d9466
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/SENSORS.md
@@ -0,0 +1,25 @@
+# Sensors
+
+List the signals used to evaluate whether the system is on target.
+
+## Required Sensors
+
+- CI results (lint/typecheck/test/smoke)
+- Structured runtime events
+- Trace spans for long workflows
+- Regression eval outcomes
+- Review outcomes (requested changes, approval lag)
+
+## Signal Contracts
+
+| Sensor | Required Fields | Sampling | Storage |
+|---|---|---|---|
+| harness events | trace_id, run_id, status, duration_ms | always | logs/traces |
+| CI checks | check_name, status, duration_ms | always | CI provider |
+| eval runs | task_id, pass_fail, score, runtime | per run | eval store |
+
+## Sensor Gaps
+
+- Missing signals:
+- Noisy/unreliable signals:
+- Planned remediation:
diff --git a/.github/skills/agent-native/assets/templates/docs/control/SETPOINTS.md b/.github/skills/agent-native/assets/templates/docs/control/SETPOINTS.md
new file mode 100644
index 0000000..ff856e2
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/SETPOINTS.md
@@ -0,0 +1,19 @@
+# Setpoints
+
+Define numeric targets for the autonomous development loop.
+
+## Core Setpoints
+
+| Metric | Target | Alert Threshold | Owner |
+|---|---|---|---|
+| PR pass@1 | | | |
+| Time to actionable failure | | | |
+| Merge cycle time | | | |
+| Revert rate | | | |
+| Human intervention rate | | | |
+
+## Constraints
+
+- Required quality gates:
+- Security constraints:
+- Cost/runtime constraints:
diff --git a/.github/skills/agent-native/assets/templates/docs/control/STABILITY.md b/.github/skills/agent-native/assets/templates/docs/control/STABILITY.md
new file mode 100644
index 0000000..25358ab
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/docs/control/STABILITY.md
@@ -0,0 +1,25 @@
+# Stability
+
+Track whether the development loop remains stable under normal and disturbed conditions.
+
+## Stability Indicators
+
+- Check pass consistency over time
+- Low variance in cycle time
+- Bounded retry counts
+- Controlled regression rate
+
+## Disturbance Scenarios
+
+| Scenario | Expected Behavior | Recovery Target |
+|---|---|---|
+| dependency upgrade | temporary check failures | recover within 1 day |
+| major feature branch | higher variance | recover within sprint |
+| infra outage | degraded CI signal | recover when infra restored |
+
+## Stabilization Playbook
+
+- Reconfirm setpoints.
+- Reduce surface area of active change.
+- Enforce stricter checks temporarily.
+- Run entropy cleanup.
diff --git a/.github/skills/agent-native/assets/templates/evals/control-loop-metrics.yaml b/.github/skills/agent-native/assets/templates/evals/control-loop-metrics.yaml
new file mode 100644
index 0000000..83e6ed0
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/evals/control-loop-metrics.yaml
@@ -0,0 +1,29 @@
+version: 1
+setpoints:
+  pr_pass_at_1:
+    target: 0.70
+    alert_below: 0.55
+  merge_cycle_time_hours:
+    target: 24
+    alert_above: 48
+  revert_rate:
+    target: 0.03
+    alert_above: 0.08
+  human_intervention_rate:
+    target: 0.20
+    alert_above: 0.40
+  time_to_actionable_failure_minutes:
+    target: 10
+    alert_above: 30
+
+measurements:
+  - name: pr_pass_at_1
+    source: ci
+  - name: merge_cycle_time_hours
+    source: scm
+  - name: revert_rate
+    source: scm
+  - name: human_intervention_rate
+    source: review
+  - name: time_to_actionable_failure_minutes
+    source: ci
diff --git a/.github/skills/agent-native/assets/templates/scripts/audit_harness.sh b/.github/skills/agent-native/assets/templates/scripts/audit_harness.sh
new file mode 100755
index 0000000..0f0a581
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/audit_harness.sh
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage: scripts/audit_harness.sh [repo_path]
+
+Audit a repository for baseline harness engineering artifacts.
+EOF
+}
+
+target_path="${1:-.}"
+if [ "$target_path" = "-h" ] || [ "$target_path" = "--help" ]; then
+  usage
+  exit 0
+fi
+
+if [ ! -d "$target_path" ]; then
+  echo "error: target path does not exist: $target_path" >&2
+  exit 1
+fi
+
+target_path=$(cd "$target_path" && pwd)
+failures=0
+
+ok() {
+  echo "[ok]      $1"
+}
+
+fail() {
+  echo "[missing] $1"
+  failures=$((failures + 1))
+}
+
+check_file() {
+  local relative="$1"
+  if [ -f "$target_path/$relative" ]; then
+    ok "$relative"
+  else
+    fail "$relative"
+  fi
+}
+
+check_contains() {
+  local relative="$1"
+  local pattern="$2"
+  local label="$3"
+  local full="$target_path/$relative"
+
+  if [ ! -f "$full" ]; then
+    fail "$label (file missing: $relative)"
+    return
+  fi
+
+  if grep -Eq "$pattern" "$full"; then
+    ok "$label"
+  else
+    fail "$label"
+  fi
+}
+
+echo "Auditing harness artifacts in: $target_path"
+echo
+
+check_file "AGENTS.md"
+check_file "PLANS.md"
+check_file "docs/ARCHITECTURE.md"
+check_file "docs/OBSERVABILITY.md"
+check_file "Makefile.harness"
+check_file "scripts/audit_harness.sh"
+check_file "scripts/harness/smoke.sh"
+check_file "scripts/harness/test.sh"
+check_file "scripts/harness/lint.sh"
+check_file "scripts/harness/typecheck.sh"
+check_file ".github/workflows/harness.yml"
+
+echo
+check_contains "AGENTS.md" "Harness Commands" "AGENTS.md: Harness Commands section"
+check_contains "AGENTS.md" "Execution Plans" "AGENTS.md: Execution Plans section"
+check_contains "docs/ARCHITECTURE.md" "Boundaries" "ARCHITECTURE.md: boundary guidance"
+check_contains "docs/OBSERVABILITY.md" "Required Event Fields" "OBSERVABILITY.md: required fields"
+check_contains "Makefile.harness" "^smoke:" "Makefile.harness: smoke target"
+check_contains "Makefile.harness" "^test:" "Makefile.harness: test target"
+check_contains "Makefile.harness" "^lint:" "Makefile.harness: lint target"
+check_contains "Makefile.harness" "^typecheck:" "Makefile.harness: typecheck target"
+check_contains "Makefile.harness" "^ci:" "Makefile.harness: ci target"
+check_contains ".github/workflows/harness.yml" "make ci" "CI workflow executes make ci"
+
+echo
+if [ "$failures" -gt 0 ]; then
+  echo "Harness audit failed: $failures issue(s) detected."
+  exit 1
+fi
+
+echo "Harness audit passed."
diff --git a/.github/skills/agent-native/assets/templates/scripts/harness/entropy_check.sh b/.github/skills/agent-native/assets/templates/scripts/harness/entropy_check.sh
new file mode 100755
index 0000000..4869752
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/harness/entropy_check.sh
@@ -0,0 +1,54 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+failures=0
+
+check_exists() {
+  local rel="$1"
+  if [ -e "$rel" ]; then
+    echo "[ok]      $rel"
+  else
+    echo "[missing] $rel"
+    failures=$((failures + 1))
+  fi
+}
+
+check_not_contains() {
+  local rel="$1"
+  local pattern="$2"
+  local label="$3"
+  if [ ! -f "$rel" ]; then
+    echo "[missing] $label (file missing: $rel)"
+    failures=$((failures + 1))
+    return
+  fi
+  if grep -En "$pattern" "$rel" >/dev/null 2>&1; then
+    echo "[drift]   $label"
+    failures=$((failures + 1))
+  else
+    echo "[ok]      $label"
+  fi
+}
+
+echo "Entropy check: $root_dir"
+echo
+
+check_exists "AGENTS.md"
+check_exists "PLANS.md"
+check_exists "docs/ARCHITECTURE.md"
+check_exists "docs/OBSERVABILITY.md"
+check_exists "Makefile.harness"
+
+check_not_contains "AGENTS.md" "<project-name>|<runtime>|<entrypoints>" "AGENTS.md placeholders removed"
+check_not_contains "docs/ARCHITECTURE.md" "^# Architecture$" "ARCHITECTURE customized"
+check_not_contains "docs/OBSERVABILITY.md" "^# Observability$" "OBSERVABILITY customized"
+
+echo
+if [ "$failures" -gt 0 ]; then
+  echo "Entropy check failed: $failures issue(s)."
+  exit 1
+fi
+echo "Entropy check passed."
diff --git a/.github/skills/agent-native/assets/templates/scripts/harness/format.sh b/.github/skills/agent-native/assets/templates/scripts/harness/format.sh
new file mode 100644
index 0000000..d96cf66
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/harness/format.sh
@@ -0,0 +1,57 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# FORMAT — Auto-format source code.
+#
+# Auto-detects the project type and runs the appropriate formatter.
+# Customize or set HARNESS_FORMAT_CMD.
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_FORMAT_CMD:-}" ]; then
+  eval "$HARNESS_FORMAT_CMD"
+  exit 0
+fi
+
+if [ -f "$root_dir/Cargo.toml" ] && command -v cargo >/dev/null 2>&1; then
+  cargo fmt
+  exit 0
+fi
+
+if [ -f "$root_dir/package.json" ] && command -v npx >/dev/null 2>&1; then
+  if npx prettier --version >/dev/null 2>&1; then
+    npx prettier --write .
+    exit 0
+  elif npx biome --version >/dev/null 2>&1; then
+    npx biome format --write .
+    exit 0
+  fi
+fi
+
+if [ -f "$root_dir/go.mod" ] && command -v gofmt >/dev/null 2>&1; then
+  gofmt -w .
+  exit 0
+fi
+
+if [ -f "$root_dir/pyproject.toml" ]; then
+  if command -v uv >/dev/null 2>&1 && uv run ruff --version >/dev/null 2>&1; then
+    uv run ruff format .
+    exit 0
+  elif command -v ruff >/dev/null 2>&1; then
+    ruff format .
+    exit 0
+  elif command -v black >/dev/null 2>&1; then
+    black .
+    exit 0
+  fi
+fi
+
+if ls "$root_dir"/*.csproj >/dev/null 2>&1 || ls "$root_dir"/*.sln >/dev/null 2>&1; then
+  dotnet format
+  exit 0
+fi
+
+echo "ERROR: No formatter detected."
+echo "Set HARNESS_FORMAT_CMD or customize scripts/harness/format.sh"
+exit 1
diff --git a/.github/skills/agent-native/assets/templates/scripts/harness/lint.sh b/.github/skills/agent-native/assets/templates/scripts/harness/lint.sh
new file mode 100755
index 0000000..42726b8
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/harness/lint.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# LINT — Run static analysis / linting.
+#
+# This catches style violations, unused imports, and common bugs.
+# Must exit non-zero on any violation (treat lint failures as blocking).
+#
+# Customize or set HARNESS_LINT_CMD.
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_LINT_CMD:-}" ]; then
+  eval "$HARNESS_LINT_CMD"
+  exit 0
+fi
+
+if [ -f "$root_dir/Cargo.toml" ] && command -v cargo >/dev/null 2>&1; then
+  cargo clippy --all-targets --all-features -- -D warnings
+  exit 0
+fi
+
+if [ -f "$root_dir/package.json" ] && command -v npm >/dev/null 2>&1; then
+  if node -e 'const p=require("./package.json"); process.exit(p.scripts&&p.scripts.lint?0:1)' 2>/dev/null; then
+    npm run -s lint
+    exit 0
+  fi
+fi
+
+if [ -f "$root_dir/go.mod" ] && command -v go >/dev/null 2>&1; then
+  go vet ./...
+  if command -v golangci-lint >/dev/null 2>&1; then
+    golangci-lint run
+  fi
+  exit 0
+fi
+
+if [ -f "$root_dir/pyproject.toml" ]; then
+  if command -v uv >/dev/null 2>&1 && uv run ruff --version >/dev/null 2>&1; then
+    uv run ruff check .
+    uv run ruff format --check .
+    exit 0
+  elif command -v ruff >/dev/null 2>&1; then
+    ruff check .
+    ruff format --check .
+    exit 0
+  fi
+fi
+
+if ls "$root_dir"/*.csproj >/dev/null 2>&1 || ls "$root_dir"/*.sln >/dev/null 2>&1; then
+  dotnet format --verify-no-changes
+  exit 0
+fi
+
+echo "ERROR: No linter detected."
+echo "Set HARNESS_LINT_CMD or customize scripts/harness/lint.sh"
+exit 1
diff --git a/.github/skills/agent-native/assets/templates/scripts/harness/setup.sh b/.github/skills/agent-native/assets/templates/scripts/harness/setup.sh
new file mode 100644
index 0000000..986467f
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/harness/setup.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# SETUP — Install project dependencies.
+#
+# Auto-detects the project type and installs deps.
+# Customize or set HARNESS_SETUP_CMD.
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_SETUP_CMD:-}" ]; then
+  eval "$HARNESS_SETUP_CMD"
+  exit 0
+fi
+
+if [ -f "$root_dir/Cargo.toml" ] && command -v cargo >/dev/null 2>&1; then
+  cargo fetch
+  exit 0
+fi
+
+if [ -f "$root_dir/package.json" ] && command -v npm >/dev/null 2>&1; then
+  npm install
+  exit 0
+fi
+
+if [ -f "$root_dir/go.mod" ] && command -v go >/dev/null 2>&1; then
+  go mod download
+  exit 0
+fi
+
+if [ -f "$root_dir/pyproject.toml" ]; then
+  if command -v uv >/dev/null 2>&1; then
+    uv sync
+    exit 0
+  elif command -v pip >/dev/null 2>&1; then
+    pip install -e .
+    exit 0
+  fi
+fi
+
+if ls "$root_dir"/*.csproj >/dev/null 2>&1 || ls "$root_dir"/*.sln >/dev/null 2>&1; then
+  dotnet restore
+  exit 0
+fi
+
+echo "ERROR: No project type detected."
+echo "Set HARNESS_SETUP_CMD or customize scripts/harness/setup.sh"
+exit 1
diff --git a/.github/skills/agent-native/assets/templates/scripts/harness/smoke.sh b/.github/skills/agent-native/assets/templates/scripts/harness/smoke.sh
new file mode 100755
index 0000000..7f3caed
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/harness/smoke.sh
@@ -0,0 +1,267 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# SMOKE TEST — Verify the app starts, responds, and shuts down cleanly.
+#
+# Supports three modes via HARNESS_SMOKE_MODE:
+#   server   — start an HTTP server, poll health, hit a smoke endpoint (default for web apps)
+#   cli      — run the CLI tool with --help/--version and check exit code
+#   library  — run a basic import/require check
+#   auto     — auto-detect which mode to use
+#
+# See docs/OBSERVABILITY.md for the logging convention.
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+SMOKE_MODE="${HARNESS_SMOKE_MODE:-auto}"
+
+# ─── AUTO-DETECT MODE ──────────────────────────────────────────────
+detect_mode() {
+  # Check for server indicators
+  if [ -f "$root_dir/pyproject.toml" ]; then
+    if grep -qE "uvicorn|fastapi|flask|django|gunicorn|starlette" "$root_dir/pyproject.toml" 2>/dev/null; then
+      echo "server"; return
+    fi
+    if grep -qE "argparse|click|typer|fire" "$root_dir/pyproject.toml" 2>/dev/null; then
+      echo "cli"; return
+    fi
+  fi
+  if [ -f "$root_dir/package.json" ]; then
+    if grep -qE '"express"|"fastify"|"koa"|"hapi"|"next"|"nuxt"' "$root_dir/package.json" 2>/dev/null; then
+      echo "server"; return
+    fi
+    if grep -qE '"commander"|"yargs"|"meow"|"inquirer"' "$root_dir/package.json" 2>/dev/null; then
+      echo "cli"; return
+    fi
+    # Has a "start" script → likely server
+    if node -e 'const p=require("./package.json"); process.exit(p.scripts&&p.scripts.start?0:1)' 2>/dev/null; then
+      echo "server"; return
+    fi
+  fi
+  if [ -f "$root_dir/go.mod" ]; then
+    if grep -rqE "net/http|gin|echo|fiber|chi" "$root_dir"/*.go "$root_dir"/cmd/ "$root_dir"/internal/ 2>/dev/null; then
+      echo "server"; return
+    fi
+    if grep -rqE "cobra|urfave/cli|kong" "$root_dir"/*.go "$root_dir"/cmd/ "$root_dir"/internal/ 2>/dev/null; then
+      echo "cli"; return
+    fi
+    # Has a main.go in root → likely CLI
+    if [ -f "$root_dir/main.go" ]; then
+      echo "cli"; return
+    fi
+  fi
+  if [ -f "$root_dir/Cargo.toml" ]; then
+    if grep -qE "actix|axum|rocket|warp|hyper" "$root_dir/Cargo.toml" 2>/dev/null; then
+      echo "server"; return
+    fi
+    if grep -qE "clap|structopt" "$root_dir/Cargo.toml" 2>/dev/null; then
+      echo "cli"; return
+    fi
+    # Has [[bin]] → likely CLI
+    if grep -q '\[\[bin\]\]' "$root_dir/Cargo.toml" 2>/dev/null; then
+      echo "cli"; return
+    fi
+  fi
+  if ls "$root_dir"/*.csproj >/dev/null 2>&1; then
+    if grep -qE "Microsoft\.AspNetCore|Kestrel" "$root_dir"/*.csproj 2>/dev/null; then
+      echo "server"; return
+    fi
+    echo "cli"; return
+  fi
+  # Default: library (no server or CLI indicators found)
+  echo "library"
+}
+
+if [ "$SMOKE_MODE" = "auto" ]; then
+  SMOKE_MODE=$(detect_mode)
+  echo "Auto-detected smoke mode: $SMOKE_MODE"
+fi
+
+# ─── LIBRARY MODE ──────────────────────────────────────────────────
+smoke_library() {
+  echo "Running library import check..."
+  if [ -f "$root_dir/pyproject.toml" ]; then
+    # Extract package name from pyproject.toml
+    local pkg
+    pkg=$(grep -E '^\s*name\s*=' "$root_dir/pyproject.toml" | head -1 | sed 's/.*=\s*["'"'"']\([^"'"'"']*\)["'"'"'].*/\1/' | tr '-' '_')
+    if [ -n "$pkg" ]; then
+      if command -v uv >/dev/null 2>&1; then
+        uv run python -c "import $pkg; print('Import OK:', $pkg.__name__ if hasattr($pkg, '__name__') else '$pkg')"
+      else
+        python -c "import $pkg; print('Import OK:', $pkg.__name__ if hasattr($pkg, '__name__') else '$pkg')"
+      fi
+      echo "Library smoke test passed ✅"
+      return 0
+    fi
+  fi
+  if [ -f "$root_dir/package.json" ]; then
+    local pkg
+    pkg=$(node -e 'console.log(require("./package.json").name || "")' 2>/dev/null)
+    if [ -n "$pkg" ]; then
+      node -e "require('$pkg'); console.log('Import OK: $pkg')" 2>/dev/null || \
+        node -e "require('.'); console.log('Import OK: $pkg')"
+      echo "Library smoke test passed ✅"
+      return 0
+    fi
+  fi
+  if [ -f "$root_dir/Cargo.toml" ] && command -v cargo >/dev/null 2>&1; then
+    cargo build --quiet
+    echo "Library smoke test passed ✅ (cargo build succeeded)"
+    return 0
+  fi
+  if [ -f "$root_dir/go.mod" ] && command -v go >/dev/null 2>&1; then
+    go build ./...
+    echo "Library smoke test passed ✅ (go build succeeded)"
+    return 0
+  fi
+  echo "ERROR: Cannot determine library import check."
+  echo "Set HARNESS_SMOKE_MODE=server or HARNESS_SMOKE_MODE=cli, or customize scripts/harness/smoke.sh"
+  exit 1
+}
+
+# ─── CLI MODE ──────────────────────────────────────────────────────
+smoke_cli() {
+  local cli_cmd="${HARNESS_SMOKE_CLI_CMD:-}"
+
+  if [ -z "$cli_cmd" ]; then
+    # Auto-detect CLI command
+    if [ -f "$root_dir/pyproject.toml" ]; then
+      # Look for [project.scripts] entries
+      local entry
+      entry=$(grep -A5 '\[project\.scripts\]' "$root_dir/pyproject.toml" 2>/dev/null | grep -E '^\s*\w+\s*=' | head -1 | sed 's/\s*=.*//' | tr -d ' ')
+      if [ -n "$entry" ]; then
+        if command -v uv >/dev/null 2>&1; then
+          cli_cmd="uv run $entry"
+        else
+          cli_cmd="$entry"
+        fi
+      fi
+    fi
+    if [ -z "$cli_cmd" ] && [ -f "$root_dir/package.json" ]; then
+      local bin
+      bin=$(node -e 'const p=require("./package.json"); const b=p.bin; if(typeof b==="string"){console.log(p.name)}else if(b){console.log(Object.keys(b)[0])}' 2>/dev/null)
+      if [ -n "$bin" ]; then
+        cli_cmd="npx $bin"
+      fi
+    fi
+    if [ -z "$cli_cmd" ] && [ -f "$root_dir/go.mod" ]; then
+      cli_cmd="go run ."
+    fi
+    if [ -z "$cli_cmd" ] && [ -f "$root_dir/Cargo.toml" ]; then
+      cli_cmd="cargo run --quiet --"
+    fi
+  fi
+
+  if [ -z "$cli_cmd" ]; then
+    echo "ERROR: Cannot detect CLI command."
+    echo "Set HARNESS_SMOKE_CLI_CMD or customize scripts/harness/smoke.sh"
+    exit 1
+  fi
+
+  echo "Running CLI smoke: $cli_cmd --help"
+  if $cli_cmd --help >/dev/null 2>&1; then
+    echo "CLI smoke test passed ✅ (--help exited 0)"
+    return 0
+  fi
+
+  echo "Trying: $cli_cmd --version"
+  if $cli_cmd --version >/dev/null 2>&1; then
+    echo "CLI smoke test passed ✅ (--version exited 0)"
+    return 0
+  fi
+
+  echo "ERROR: CLI smoke failed ($cli_cmd --help and --version both failed)"
+  exit 1
+}
+
+# ─── SERVER MODE ───────────────────────────────────────────────────
+smoke_server() {
+  # ─── CUSTOMIZE THESE ───────────────────────────────────────────────
+  START_CMD="${HARNESS_SMOKE_START_CMD:-}"
+  HEALTH_URL="${HARNESS_SMOKE_HEALTH_URL:-http://localhost:8080/health}"
+  SMOKE_URL="${HARNESS_SMOKE_URL:-http://localhost:8080/}"
+  PORT="${HARNESS_SMOKE_PORT:-8080}"
+  READY_TIMEOUT=15
+
+  # ─── AUTO-DETECT START COMMAND (if not set) ────────────────────────
+  if [ -z "$START_CMD" ]; then
+    if [ -f "$root_dir/pyproject.toml" ]; then
+      if grep -q "uvicorn" "$root_dir/pyproject.toml" 2>/dev/null; then
+        START_CMD="uv run uvicorn app.main:app --host 0.0.0.0 --port $PORT"
+      elif grep -q "fastapi" "$root_dir/pyproject.toml" 2>/dev/null; then
+        START_CMD="uv run fastapi run --port $PORT"
+      fi
+    elif [ -f "$root_dir/package.json" ]; then
+      if node -e 'const p=require("./package.json"); process.exit(p.scripts&&p.scripts.start?0:1)' 2>/dev/null; then
+        START_CMD="npm run -s start"
+      fi
+    elif [ -f "$root_dir/go.mod" ]; then
+      START_CMD="go run ."
+    elif ls "$root_dir"/*.csproj >/dev/null 2>&1; then
+      START_CMD="dotnet run --urls http://0.0.0.0:$PORT"
+    fi
+  fi
+
+  if [ -z "$START_CMD" ]; then
+    echo "ERROR: Cannot detect start command."
+    echo "Set HARNESS_SMOKE_START_CMD or customize scripts/harness/smoke.sh"
+    exit 1
+  fi
+
+  # ─── CLEANUP TRAP ──────────────────────────────────────────────────
+  SERVER_PID=""
+  cleanup() {
+    if [ -n "$SERVER_PID" ] && kill -0 "$SERVER_PID" 2>/dev/null; then
+      kill "$SERVER_PID" 2>/dev/null || true
+      wait "$SERVER_PID" 2>/dev/null || true
+    fi
+  }
+  trap cleanup EXIT
+
+  # ─── START SERVER ──────────────────────────────────────────────────
+  echo "Starting server: $START_CMD"
+  $START_CMD &
+  SERVER_PID=$!
+
+  # ─── WAIT FOR READY ───────────────────────────────────────────────
+  echo "Waiting for server on $HEALTH_URL (timeout: ${READY_TIMEOUT}s)..."
+  elapsed=0
+  while [ "$elapsed" -lt "$READY_TIMEOUT" ]; do
+    if curl -sf "$HEALTH_URL" >/dev/null 2>&1; then
+      echo "Server ready after ${elapsed}s"
+      break
+    fi
+    sleep 1
+    elapsed=$((elapsed + 1))
+  done
+
+  if [ "$elapsed" -ge "$READY_TIMEOUT" ]; then
+    echo "ERROR: Server did not become ready within ${READY_TIMEOUT}s"
+    exit 1
+  fi
+
+  # ─── SMOKE REQUESTS ───────────────────────────────────────────────
+  echo "Hitting smoke endpoint: $SMOKE_URL"
+  HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" "$SMOKE_URL" 2>&1 || echo "000")
+  if [ "$HTTP_CODE" = "000" ]; then
+    echo "ERROR: Smoke request failed (no response)"
+    exit 1
+  fi
+  echo "Smoke response: HTTP $HTTP_CODE"
+
+  # ─── DONE ─────────────────────────────────────────────────────────
+  echo "Smoke test passed ✅"
+}
+
+# ─── DISPATCH ──────────────────────────────────────────────────────
+case "$SMOKE_MODE" in
+  server)  smoke_server ;;
+  cli)     smoke_cli ;;
+  library) smoke_library ;;
+  *)
+    echo "ERROR: Unknown HARNESS_SMOKE_MODE: $SMOKE_MODE"
+    echo "Valid values: server, cli, library, auto"
+    exit 1
+    ;;
+esac
diff --git a/.github/skills/agent-native/assets/templates/scripts/harness/test.sh b/.github/skills/agent-native/assets/templates/scripts/harness/test.sh
new file mode 100755
index 0000000..6d5a467
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/harness/test.sh
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# TEST — Run the project's test suite.
+#
+# Requirements:
+#   - At least 5 meaningful tests covering core functionality (CRUD, error cases)
+#   - Tests should verify real behavior, not just "assert True"
+#   - Tests must be deterministic (no flaky network calls, no shared state leaks)
+#
+# Customize the detection below or set HARNESS_TEST_CMD.
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_TEST_CMD:-}" ]; then
+  eval "$HARNESS_TEST_CMD"
+  exit 0
+fi
+
+if [ -f "$root_dir/Cargo.toml" ] && command -v cargo >/dev/null 2>&1; then
+  cargo test --quiet
+  exit 0
+fi
+
+if [ -f "$root_dir/package.json" ] && command -v npm >/dev/null 2>&1; then
+  if node -e 'const p=require("./package.json"); process.exit(p.scripts&&p.scripts.test?0:1)' 2>/dev/null; then
+    npm run -s test
+    exit 0
+  fi
+fi
+
+if [ -f "$root_dir/go.mod" ] && command -v go >/dev/null 2>&1; then
+  go test ./...
+  exit 0
+fi
+
+if [ -f "$root_dir/pyproject.toml" ]; then
+  if command -v uv >/dev/null 2>&1; then
+    uv run pytest -q
+    exit 0
+  elif command -v pytest >/dev/null 2>&1; then
+    pytest -q
+    exit 0
+  fi
+fi
+
+if ls "$root_dir"/*.csproj >/dev/null 2>&1 || ls "$root_dir"/*.sln >/dev/null 2>&1; then
+  dotnet test --verbosity quiet
+  exit 0
+fi
+
+echo "ERROR: No test runner detected."
+echo "Set HARNESS_TEST_CMD or customize scripts/harness/test.sh"
+exit 1
diff --git a/.github/skills/agent-native/assets/templates/scripts/harness/typecheck.sh b/.github/skills/agent-native/assets/templates/scripts/harness/typecheck.sh
new file mode 100755
index 0000000..e00ecc2
--- /dev/null
+++ b/.github/skills/agent-native/assets/templates/scripts/harness/typecheck.sh
@@ -0,0 +1,61 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# TYPECHECK — Run the type checker / compiler verification.
+#
+# Must exit non-zero on type errors (treat type failures as blocking).
+# For compiled languages (Go, C#, Rust), this is the build step.
+#
+# Customize or set HARNESS_TYPECHECK_CMD.
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_TYPECHECK_CMD:-}" ]; then
+  eval "$HARNESS_TYPECHECK_CMD"
+  exit 0
+fi
+
+if [ -f "$root_dir/Cargo.toml" ] && command -v cargo >/dev/null 2>&1; then
+  cargo check --quiet
+  exit 0
+fi
+
+if [ -f "$root_dir/package.json" ] && command -v npm >/dev/null 2>&1; then
+  if node -e 'const p=require("./package.json"); process.exit(p.scripts&&p.scripts.typecheck?0:1)' 2>/dev/null; then
+    npm run -s typecheck
+    exit 0
+  fi
+  # Fallback: direct tsc
+  if command -v npx >/dev/null 2>&1 && [ -f "$root_dir/tsconfig.json" ]; then
+    npx tsc --noEmit
+    exit 0
+  fi
+fi
+
+if [ -f "$root_dir/go.mod" ] && command -v go >/dev/null 2>&1; then
+  go vet ./...
+  exit 0
+fi
+
+if [ -f "$root_dir/pyproject.toml" ]; then
+  if command -v uv >/dev/null 2>&1 && uv run mypy --version >/dev/null 2>&1; then
+    uv run mypy .
+    exit 0
+  elif command -v mypy >/dev/null 2>&1; then
+    mypy .
+    exit 0
+  elif command -v pyright >/dev/null 2>&1; then
+    pyright
+    exit 0
+  fi
+fi
+
+if ls "$root_dir"/*.csproj >/dev/null 2>&1 || ls "$root_dir"/*.sln >/dev/null 2>&1; then
+  dotnet build --nologo --verbosity quiet
+  exit 0
+fi
+
+echo "ERROR: No type checker detected."
+echo "Set HARNESS_TYPECHECK_CMD or customize scripts/harness/typecheck.sh"
+exit 1
diff --git a/.github/skills/agent-native/references/agent-hooks/README.md b/.github/skills/agent-native/references/agent-hooks/README.md
new file mode 100644
index 0000000..f0dd862
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/README.md
@@ -0,0 +1,55 @@
+# Agent Hooks Reference
+
+Agent hooks are lifecycle callbacks that fire at key points during an agent's session — before/after tool use, on session start/stop, etc. They provide **deterministic, code-driven automation** that runs regardless of how the agent is prompted.
+
+Hooks are the harness engineering mechanism for enforcing policy, automating quality checks, and creating audit trails without relying on the agent to "remember" to do things.
+
+## Why Hooks Matter for Harness Engineering
+
+| Practice | How Hooks Help |
+|---|---|
+| Make easy to do hard thing | PostToolUse auto-runs formatters/linters after edits |
+| Build observability in from day 1 | Log every tool invocation to JSONL for audit |
+| Invest in static analysis | PostToolUse triggers lint after file changes |
+| Manage entropy | Session hooks validate project state on start |
+| Bring your own harness | Hooks enforce repo-specific policies deterministically |
+
+## Hook Implementations by Agent
+
+Each agent has its own hook configuration format. See the subfolder for your agent:
+
+- **[copilot-hooks/](./copilot-hooks/)** — GitHub Copilot (VS Code, CLI, coding agent)
+- _(future: claude-code/, cursor/, openclaw/, etc.)_
+
+## Common Patterns (Agent-Agnostic)
+
+These patterns apply regardless of which agent you're using. Implement them in your agent's hook format.
+
+### 1. Auto-Format After File Edits
+
+Run your formatter (Prettier, Ruff, gofmt, etc.) after every file edit. The agent sees the formatted result, not its raw output.
+
+### 2. Auto-Lint After File Edits
+
+Run lint checks after edits. If the agent introduced a violation, the error appears in context immediately — no waiting for CI.
+
+### 3. Block Dangerous Commands
+
+Deny destructive operations (`rm -rf /`, `DROP TABLE`, `git push --force`) at the pre-tool-use stage. The agent gets a clear denial reason and can retry safely.
+
+### 4. Audit Trail
+
+Log every tool invocation (tool name, args, result, timestamp) to `.harness/agent-audit.jsonl`. Useful for debugging agent behavior and measuring tool usage patterns.
+
+### 5. Session Initialization
+
+On session start, validate project state: check dependencies are installed, test DB is seeded, required env vars exist. Inject context about project state into the agent.
+
+### 6. Observability Integration
+
+After tool use, append structured events to `.harness/logs.jsonl`:
+```json
+{"ts":"...","level":"INFO","msg":"tool_use","tool":"bash","args":"npm test","result":"success","duration_ms":3200}
+```
+
+This bridges agent activity into the same JSONL observability pipeline the app uses.
diff --git a/.github/skills/agent-native/references/agent-hooks/copilot-hooks/README.md b/.github/skills/agent-native/references/agent-hooks/copilot-hooks/README.md
new file mode 100644
index 0000000..538010f
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/copilot-hooks/README.md
@@ -0,0 +1,393 @@
+# GitHub Copilot Hooks
+
+Hooks for GitHub Copilot agents — VS Code agent mode, Copilot CLI, and the coding agent (cloud). All three share the same hook configuration format.
+
+**Sources:**
+- [VS Code hooks docs](https://code.visualstudio.com/docs/copilot/customization/hooks)
+- [GitHub hooks reference](https://docs.github.com/en/copilot/reference/hooks-configuration)
+- [Copilot CLI hooks tutorial](https://docs.github.com/en/copilot/tutorials/copilot-cli-hooks)
+
+---
+
+## Configuration
+
+Hooks are JSON files with a `hooks` object keyed by event type. VS Code and Copilot CLI share the same format (also compatible with Claude Code's `settings.json`).
+
+### File Locations (searched in order)
+
+| Location | Scope | Committed? |
+|---|---|---|
+| `.github/hooks/*.json` | Project (shared with team) | ✅ Yes |
+| `.claude/settings.json` | Project (compatible format) | ✅ Yes |
+| `.claude/settings.local.json` | Project (personal) | ❌ No |
+| `~/.claude/settings.json` | User (all projects) | ❌ No |
+
+Workspace hooks take precedence over user hooks for the same event type.
+
+### Format
+
+```json
+{
+  "hooks": {
+    "PreToolUse": [
+      {
+        "type": "command",
+        "command": "./scripts/pre-tool.sh",
+        "timeout": 15
+      }
+    ],
+    "PostToolUse": [
+      {
+        "type": "command",
+        "command": "./scripts/post-tool.sh"
+      }
+    ]
+  }
+}
+```
+
+### Command Properties
+
+| Property | Type | Description |
+|---|---|---|
+| `type` | string | Must be `"command"` |
+| `command` | string | Default command (cross-platform) |
+| `bash` | string | Bash-specific command (CLI) |
+| `windows` | string | Windows override |
+| `linux` | string | Linux override |
+| `osx` | string | macOS override |
+| `cwd` | string | Working directory (relative to repo root) |
+| `env` | object | Additional environment variables |
+| `timeout` / `timeoutSec` | number | Timeout in seconds (default: 30) |
+
+---
+
+## Hook Events
+
+### SessionStart
+**When:** User submits first prompt of a new session (or resumes one).
+
+**Input:**
+```json
+{
+  "timestamp": 1704614400000,
+  "cwd": "/path/to/project",
+  "source": "new",
+  "initialPrompt": "Fix the auth bug"
+}
+```
+- `source`: `"new"`, `"resume"`, or `"startup"`
+
+**Output:** Ignored.
+
+**Harness use:** Validate project state, initialize resources, log session start.
+
+### UserPromptSubmit
+**When:** User submits a prompt.
+
+**Input:**
+```json
+{
+  "timestamp": 1704614500000,
+  "cwd": "/path/to/project",
+  "prompt": "Fix the authentication bug"
+}
+```
+
+**Output:** Ignored.
+
+**Harness use:** Audit trail of user prompts.
+
+### PreToolUse ⚡ (most powerful)
+**When:** Before the agent invokes any tool (bash, edit, view, create, etc.).
+
+**Input:**
+```json
+{
+  "timestamp": 1704614600000,
+  "cwd": "/path/to/project",
+  "toolName": "bash",
+  "toolArgs": "{\"command\":\"rm -rf dist\"}"
+}
+```
+- VS Code also includes `tool_input` (parsed object) and `tool_use_id`
+
+**Output (optional):**
+```json
+{
+  "permissionDecision": "deny",
+  "permissionDecisionReason": "Destructive command blocked by policy"
+}
+```
+
+| `permissionDecision` | Effect |
+|---|---|
+| `"allow"` | Auto-approve (skip user confirmation) |
+| `"deny"` | Block execution, show reason to agent |
+| `"ask"` | Require user confirmation (VS Code only) |
+
+When multiple hooks run, **most restrictive wins** (deny > ask > allow).
+
+VS Code also supports `updatedInput` (modify tool args) and `additionalContext` (inject context for the model).
+
+**Harness use:** Block dangerous commands, enforce file path restrictions, require approval for sensitive ops.
+
+### PostToolUse
+**When:** After a tool completes (success or failure).
+
+**Input:**
+```json
+{
+  "timestamp": 1704614700000,
+  "cwd": "/path/to/project",
+  "toolName": "bash",
+  "toolArgs": "{\"command\":\"npm test\"}",
+  "toolResult": {
+    "resultType": "success",
+    "textResultForLlm": "All tests passed (15/15)"
+  }
+}
+```
+- `resultType`: `"success"`, `"failure"`, or `"denied"`
+
+**Output:** Ignored.
+
+**Harness use:** Auto-format, auto-lint, log tool results, trigger follow-up checks.
+
+### PreCompact (VS Code only)
+**When:** Before conversation context is compacted (truncated for length).
+
+**Harness use:** Export important context before it's lost.
+
+### SubagentStart / SubagentStop (VS Code only)
+**When:** Subagent spawned / completed.
+
+**Harness use:** Track nested agent usage, aggregate results.
+
+### Stop
+**When:** Agent session ends.
+
+**Input:**
+```json
+{
+  "timestamp": 1704618000000,
+  "cwd": "/path/to/project",
+  "reason": "complete"
+}
+```
+- `reason`: `"complete"`, `"error"`, `"abort"`, `"timeout"`, `"user_exit"`
+
+**Output:** Ignored.
+
+**Harness use:** Generate reports, cleanup temp files, finalize audit log.
+
+### ErrorOccurred (CLI only)
+**When:** Error during agent execution.
+
+**Harness use:** Log errors, alert on failures.
+
+---
+
+## Exit Codes
+
+| Code | Behavior |
+|---|---|
+| 0 | Success — parse stdout as JSON |
+| 2 | Blocking error — stop processing, show error to model |
+| Other | Non-blocking warning — show warning, continue |
+
+---
+
+## Harness Engineering Hook Recipes
+
+### Recipe 1: Auto-Format + Lint After Edits
+
+`.github/hooks/post-tool.json`:
+```json
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "type": "command",
+        "command": "./scripts/hooks/post-edit.sh",
+        "timeout": 30
+      }
+    ]
+  }
+}
+```
+
+`scripts/hooks/post-edit.sh`:
+```bash
+#!/bin/bash
+INPUT=$(cat)
+TOOL_NAME=$(echo "$INPUT" | jq -r '.toolName')
+
+# Only run after file edits
+if [[ "$TOOL_NAME" == "edit" || "$TOOL_NAME" == "create" || "$TOOL_NAME" == "editFiles" ]]; then
+  # Format
+  npx prettier --write "$(echo "$INPUT" | jq -r '.toolArgs' | jq -r '.path // .files[0]')" 2>/dev/null
+
+  # Quick lint (non-blocking — exit 0 regardless)
+  npx eslint --format json "$(echo "$INPUT" | jq -r '.toolArgs' | jq -r '.path // .files[0]')" \
+    >> .harness/lint-results.jsonl 2>/dev/null || true
+fi
+```
+
+### Recipe 2: Block Dangerous Commands
+
+`.github/hooks/safety.json`:
+```json
+{
+  "hooks": {
+    "PreToolUse": [
+      {
+        "type": "command",
+        "command": "./scripts/hooks/safety-check.sh",
+        "timeout": 5
+      }
+    ]
+  }
+}
+```
+
+`scripts/hooks/safety-check.sh`:
+```bash
+#!/bin/bash
+INPUT=$(cat)
+TOOL_NAME=$(echo "$INPUT" | jq -r '.toolName')
+TOOL_ARGS=$(echo "$INPUT" | jq -r '.toolArgs')
+
+# Block dangerous bash commands
+if [ "$TOOL_NAME" = "bash" ]; then
+  CMD=$(echo "$TOOL_ARGS" | jq -r '.command')
+  if echo "$CMD" | grep -qE "rm -rf /|DROP TABLE|git push.*--force|:(){ :|:& };:"; then
+    echo '{"permissionDecision":"deny","permissionDecisionReason":"Blocked by safety policy: destructive command detected"}'
+    exit 0
+  fi
+fi
+
+# Block edits outside src/ and test/
+if [[ "$TOOL_NAME" == "edit" || "$TOOL_NAME" == "create" ]]; then
+  FILE_PATH=$(echo "$TOOL_ARGS" | jq -r '.path')
+  if [[ ! "$FILE_PATH" =~ ^(src/|test/|tests/) ]]; then
+    echo '{"permissionDecision":"deny","permissionDecisionReason":"Can only edit files in src/ or test/ directories"}'
+    exit 0
+  fi
+fi
+
+# Allow everything else
+echo '{"permissionDecision":"allow"}'
+```
+
+### Recipe 3: Audit Trail to JSONL
+
+`.github/hooks/audit.json`:
+```json
+{
+  "hooks": {
+    "PreToolUse": [
+      {
+        "type": "command",
+        "command": "./scripts/hooks/audit-log.sh"
+      }
+    ],
+    "PostToolUse": [
+      {
+        "type": "command",
+        "command": "./scripts/hooks/audit-result.sh"
+      }
+    ]
+  }
+}
+```
+
+`scripts/hooks/audit-log.sh`:
+```bash
+#!/bin/bash
+INPUT=$(cat)
+TOOL=$(echo "$INPUT" | jq -r '.toolName')
+ARGS=$(echo "$INPUT" | jq -r '.toolArgs')
+TS=$(echo "$INPUT" | jq -r '.timestamp')
+
+echo "{\"ts\":$TS,\"event\":\"pre_tool\",\"tool\":\"$TOOL\",\"args\":$ARGS}" >> .harness/agent-audit.jsonl
+```
+
+`scripts/hooks/audit-result.sh`:
+```bash
+#!/bin/bash
+INPUT=$(cat)
+TOOL=$(echo "$INPUT" | jq -r '.toolName')
+RESULT=$(echo "$INPUT" | jq -r '.toolResult.resultType')
+TS=$(echo "$INPUT" | jq -r '.timestamp')
+
+echo "{\"ts\":$TS,\"event\":\"post_tool\",\"tool\":\"$TOOL\",\"result\":\"$RESULT\"}" >> .harness/agent-audit.jsonl
+```
+
+### Recipe 4: Session Init — Validate Project State
+
+`.github/hooks/session.json`:
+```json
+{
+  "hooks": {
+    "SessionStart": [
+      {
+        "type": "command",
+        "command": "./scripts/hooks/session-init.sh",
+        "timeout": 60
+      }
+    ]
+  }
+}
+```
+
+`scripts/hooks/session-init.sh`:
+```bash
+#!/bin/bash
+# Ensure dependencies are installed
+if [ -f "package.json" ] && [ ! -d "node_modules" ]; then
+  npm install --silent
+fi
+
+# Ensure .harness/ exists
+mkdir -p .harness/queries
+touch .harness/logs.jsonl .harness/traces.jsonl .harness/metrics.jsonl
+
+# Clear stale telemetry from previous sessions
+: > .harness/logs.jsonl
+: > .harness/traces.jsonl
+
+# Seed test data if needed
+if [ -f "scripts/harness/seed.sh" ]; then
+  ./scripts/harness/seed.sh
+fi
+
+echo "Session initialized" >> .harness/agent-audit.jsonl
+```
+
+---
+
+## Copilot-Specific Customization Files
+
+Hooks work alongside these other Copilot customization points:
+
+| File | Purpose | Auto-loaded? |
+|---|---|---|
+| `.github/copilot-instructions.md` | Always-on coding standards | ✅ Every request |
+| `AGENTS.md` | Agent instructions (multi-agent compat) | ✅ Every request |
+| `.github/skills/*/SKILL.md` | Specialized capabilities | On-demand (task match) |
+| `*.instructions.md` | File-pattern or task-based rules | On-demand (glob/description match) |
+| `.agent.md` | Custom agent personas | When selected |
+| `.github/hooks/*.json` | Lifecycle hooks (this doc) | At lifecycle events |
+
+**Skills** are an [open standard](https://agentskills.io) that work across VS Code, Copilot CLI, and the coding agent. The harness engineering skill itself can be installed in `.github/skills/` or `.agents/skills/`.
+
+---
+
+## Tips
+
+- **Keep hooks fast.** Hooks run synchronously — a slow hook blocks the agent. Use the `timeout` property.
+- **Exit 0 for success.** Exit 2 to hard-block. Any other exit code is a non-blocking warning.
+- **Hooks are deterministic.** Unlike instructions (which are suggestions), hooks execute your code. Use them for things that *must* happen.
+- **Combine hooks + instructions.** Use hooks to enforce policy; use instructions to guide behavior. They're complementary.
+- **Test hooks manually.** Pipe test JSON into your script: `echo '{"toolName":"bash","toolArgs":"{\"command\":\"rm -rf /\"}"}' | ./scripts/hooks/safety-check.sh`
diff --git a/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/README.md b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/README.md
new file mode 100644
index 0000000..795bab5
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/README.md
@@ -0,0 +1,102 @@
+# Pre-Tool Hooks for Copilot CLI
+
+Extracted from [plankton](https://github.com/alexfazio/plankton) — these are the hooks that **actually work** with GitHub Copilot CLI today.
+
+## Why only PreToolUse?
+
+Copilot CLI hooks have a critical limitation: **only `preToolUse` output is processed** by the model. Specifically, only `permissionDecision: "deny"` is honored.
+
+| Hook Event | Copilot CLI Behavior |
+|---|---|
+| `preToolUse` | ✅ `deny` blocks tool execution, reason shown to model |
+| `postToolUse` | ❌ Output ignored — fire-and-forget only |
+| `sessionStart/End` | ❌ Output ignored |
+| `errorOccurred` | ❌ Output ignored |
+
+PostToolUse hooks (linting, feedback) can still *log* to files, but the LLM never sees the results. That makes lint-detect-fix pipelines non-functional on Copilot CLI. PreToolUse guardrails are the real value.
+
+## What's Included
+
+| File | Lines | Purpose |
+|---|---|---|
+| `platform_shim.sh` | ~90 | Detects Claude Code vs Copilot CLI, normalizes tool names and JSON formats |
+| `protect_linter_configs.sh` | ~90 | Blocks edits to linter config files (`.ruff.toml`, `biome.json`, etc.) |
+| `enforce_package_managers.sh` | ~500 | Blocks legacy package managers (`pip` → `uv`, `npm` → `bun`, etc.) |
+| `config.json` | — | Protected files list + package manager enforcement config |
+| `copilot-hooks.json` | — | Drop-in `.github/hooks/` config for Copilot CLI |
+
+## Setup
+
+### For Copilot CLI (`.github/hooks/`)
+
+```bash
+# From your repo root
+mkdir -p .github/hooks
+cp platform_shim.sh protect_linter_configs.sh enforce_package_managers.sh .github/hooks/
+cp config.json .github/hooks/
+cp copilot-hooks.json .github/hooks/plankton.json
+chmod +x .github/hooks/*.sh
+```
+
+### For Claude Code (`.claude/settings.json`)
+
+```json
+{
+  "hooks": {
+    "PreToolUse": [
+      {
+        "matcher": "Edit|Write",
+        "hooks": [{ "type": "command", "command": ".claude/hooks/protect_linter_configs.sh" }]
+      },
+      {
+        "matcher": "Bash",
+        "hooks": [{ "type": "command", "command": ".claude/hooks/enforce_package_managers.sh" }]
+      }
+    ]
+  }
+}
+```
+
+Note: Claude Code supports matchers (hooks only fire for matching tools). Copilot CLI does not — the scripts filter internally via the platform shim.
+
+## Configuration
+
+Edit `config.json` to customize:
+
+### `protected_files`
+Array of filenames that agents cannot modify. Default includes common linter configs.
+
+### `package_managers`
+Per-language enforcement. Values: `"uv"`, `"bun"`, `"uv:warn"` (warn instead of block), or `false` (disabled).
+
+```json
+{
+  "package_managers": {
+    "python": "uv",
+    "javascript": "bun"
+  }
+}
+```
+
+### `allowed_subcommands`
+Read-only subcommands that bypass enforcement (e.g., `npm audit`, `pip download`).
+
+## Dependencies
+
+- `jaq` or `jq` — JSON parsing (scripts prefer `jaq`, fall back to `jq`)
+- `bash` 4+
+
+## Platform Shim
+
+The shim auto-detects which agent platform is running based on the input JSON shape:
+
+| | Claude Code | Copilot CLI |
+|---|---|---|
+| Tool name field | `tool_name` | `toolName` |
+| Tool input field | `tool_input` (object) | `toolArgs` (JSON string) |
+| Deny output | `{"decision": "block", "reason": "..."}` | `{"permissionDecision": "deny", "permissionDecisionReason": "..."}` |
+| Tool name casing | `edit`, `create`, `bash` | `edit`, `create`, `bash` (same) |
+
+## Credit
+
+Based on [plankton](https://github.com/alexfazio/plankton) by Alex Fazio. Platform shim and Copilot CLI support by [@bertclaws](https://github.com/bertclaws).
diff --git a/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/config.json b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/config.json
new file mode 100644
index 0000000..4586a40
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/config.json
@@ -0,0 +1,45 @@
+{
+  "protected_files": [
+    ".markdownlint.jsonc",
+    ".markdownlint-cli2.jsonc",
+    ".shellcheckrc",
+    ".yamllint",
+    ".hadolint.yaml",
+    ".jscpd.json",
+    ".flake8",
+    "taplo.toml",
+    ".ruff.toml",
+    "ty.toml",
+    "biome.json",
+    ".oxlintrc.json",
+    ".semgrep.yml",
+    "knip.json"
+  ],
+  "package_managers": {
+    "python": "uv",
+    "javascript": "bun",
+    "allowed_subcommands": {
+      "npm": [
+        "audit",
+        "view",
+        "pack",
+        "publish",
+        "whoami",
+        "login"
+      ],
+      "pip": [
+        "download"
+      ],
+      "yarn": [
+        "audit",
+        "info"
+      ],
+      "pnpm": [
+        "audit",
+        "info"
+      ],
+      "poetry": [],
+      "pipenv": []
+    }
+  }
+}
diff --git a/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/copilot-hooks.json b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/copilot-hooks.json
new file mode 100644
index 0000000..8b47e4f
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/copilot-hooks.json
@@ -0,0 +1,19 @@
+{
+  "version": 1,
+  "hooks": {
+    "preToolUse": [
+      {
+        "type": "command",
+        "bash": "./protect_linter_configs.sh",
+        "cwd": ".github/hooks",
+        "timeoutSec": 10
+      },
+      {
+        "type": "command",
+        "bash": "./enforce_package_managers.sh",
+        "cwd": ".github/hooks",
+        "timeoutSec": 15
+      }
+    ]
+  }
+}
diff --git a/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/enforce_package_managers.sh b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/enforce_package_managers.sh
new file mode 100755
index 0000000..ac69a55
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/enforce_package_managers.sh
@@ -0,0 +1,500 @@
+#!/bin/bash
+# enforce_package_managers.sh - Claude Code PreToolUse hook (Bash matcher)
+# Blocks legacy package managers and suggests project-preferred alternatives.
+#   python:     pip/pip3/python -m pip/python -m venv/poetry/pipenv  → uv
+#   javascript: npm/npx/yarn/pnpm                                    → bun
+#
+# Output: JSON per PreToolUse spec (always exit 0)
+#   {"decision": "approve"}
+#   {"decision": "block", "reason": "[hook:block] <tool> is not allowed. Use: <replacement>"}
+
+set -euo pipefail
+
+script_source="${BASH_SOURCE[0]}"
+script_dir="${script_source%/*}"
+[[ "${script_dir}" == "${script_source}" ]] && script_dir="."
+script_dir="$(cd "${script_dir}" && pwd)"
+# shellcheck source=.claude/hooks/platform_shim.sh
+source "${script_dir}/platform_shim.sh"
+
+input=$(cat)
+platform=$(detect_platform "${input}")
+tool_name=$(get_tool_name "${input}" "${platform}")
+
+# Session-level bypass (HOOK_SKIP_PM=1 claude ...)
+if [[ "${HOOK_SKIP_PM:-0}" == "1" ]]; then
+  emit_approve "${platform}"; exit 0
+fi
+
+# Copilot has no matcher support; filter non-bash tools here.
+if [[ "${tool_name}" != "Bash" ]]; then
+  emit_approve "${platform}"; exit 0
+fi
+
+# Extract command string; fail-open if jaq missing or input malformed
+# shellcheck disable=SC2310  # get_command is an intentional predicate here
+cmd=$(get_command "${input}" "${platform}") || {
+  emit_approve "${platform}"; exit 0
+}
+[[ -z "${cmd}" ]] && { emit_approve "${platform}"; exit 0; }
+
+config_file="${CLAUDE_PROJECT_DIR:-.}/.claude/hooks/config.json"
+
+# get_pm_enforcement(lang) — reads .package_managers.<lang> from config
+# Returns "uv", "uv:warn", "bun", "bun:warn", or "false"
+get_pm_enforcement() {
+  local lang="$1"
+  jaq -r ".package_managers.${lang} // false" \
+    "${config_file}" 2>/dev/null || echo "false"
+}
+
+# parse_pm_config(value) — splits value into mode+tool
+# false     → "off"
+# *:warn    → "warn:<tool>"
+# *         → "block:<tool>"
+parse_pm_config() {
+  local value="$1"
+  case "${value}" in
+    false) echo "off" ;;
+    *:warn) echo "warn:${value%:warn}" ;;
+    *) echo "block:${value}" ;;
+  esac
+}
+
+# is_allowed_subcommand(tool, subcmd) — checks allowlist in config
+# Returns 0 if subcmd is in the allowed list, 1 otherwise
+is_allowed_subcommand() {
+  local tool="$1"
+  local subcmd="$2"
+  local allowed
+  while IFS= read -r allowed; do
+    [[ "${subcmd}" == "${allowed}" ]] && return 0
+  done < <(jaq -r ".package_managers.allowed_subcommands.${tool} // [] | .[]" \
+    "${config_file}" 2>/dev/null || true)
+  return 1
+}
+
+# compute_replacement_message(tool, subcmd) — maps tool:subcmd to replacement
+compute_replacement_message() {
+  local tool="$1"
+  local subcmd="${2:-}"
+
+  case "${tool}:${subcmd}" in
+    pip:install|pip3:install)
+      if echo "${cmd}" | grep -qE '[[:space:]]-r([[:space:]]|[^[:space:]-])'; then
+        local req_file
+        req_file=$(echo "${cmd}" | sed -nE \
+          's/.*[[:space:]]-r[[:space:]]*([^[:space:]-][^[:space:]]*).*/\1/p')
+        echo "uv pip install -r ${req_file:-requirements.txt}"
+      elif echo "${cmd}" | grep -qE ' -e '; then
+        echo "uv pip install -e ."
+      else
+        local pkgs
+        pkgs=$(echo "${cmd}" | sed -nE 's/.*pip3?[[:space:]]+install[[:space:]]+([^-].*)/\1/p' | \
+          sed 's/[[:space:]]*$//')
+        if [[ -n "${pkgs}" ]]; then
+          echo "uv add ${pkgs}"
+        else
+          echo "uv add <packages>"
+        fi
+      fi
+      ;;
+    pip:uninstall|pip3:uninstall)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*pip3?[[:space:]]+uninstall[[:space:]]+([^-].*)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "uv remove ${pkgs}"
+      else
+        echo "uv remove <packages>"
+      fi
+      ;;
+    pip:freeze|pip3:freeze) echo "uv pip freeze" ;;
+    pip:list|pip3:list)     echo "uv pip list" ;;
+    pip:*|pip3:*)           echo "uv <equivalent>" ;;
+    "python -m pip":*)      echo "uv add <packages>" ;;
+    "python -m venv":*)
+      local venv_dir
+      venv_dir=$(echo "${cmd}" | sed -nE 's/.*python3?[[:space:]]+-m[[:space:]]+venv[[:space:]]+([^[:space:]]+).*/\1/p')
+      if [[ -n "${venv_dir}" ]]; then
+        echo "uv venv ${venv_dir}"
+      else
+        echo "uv venv"
+      fi
+      ;;
+    poetry:add)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*poetry[[:space:]]+add[[:space:]]+(.+)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "uv add ${pkgs}"
+      else
+        echo "uv add <packages>"
+      fi
+      ;;
+    poetry:install)   echo "uv sync" ;;
+    poetry:run)
+      local run_cmd
+      run_cmd=$(echo "${cmd}" | sed -nE 's/.*poetry[[:space:]]+run[[:space:]]+(.+)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${run_cmd}" ]]; then
+        echo "uv run ${run_cmd}"
+      else
+        echo "uv run <cmd>"
+      fi
+      ;;
+    poetry:lock)      echo "uv lock" ;;
+    poetry:*)         echo "uv <equivalent>" ;;
+    pipenv:install)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*pipenv[[:space:]]+install[[:space:]]+([^-].*)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "uv add ${pkgs}"
+      else
+        echo "uv sync"
+      fi
+      ;;
+    pipenv:run)
+      local run_cmd
+      run_cmd=$(echo "${cmd}" | sed -nE 's/.*pipenv[[:space:]]+run[[:space:]]+(.+)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${run_cmd}" ]]; then
+        echo "uv run ${run_cmd}"
+      else
+        echo "uv run <cmd>"
+      fi
+      ;;
+    pipenv:*)         echo "uv <equivalent>" ;;
+    npm:install|npm:i|npm:ci)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*npm[[:space:]]+(install|i|ci)[[:space:]]+([^-].*)/\2/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "bun add ${pkgs}"
+      else
+        echo "bun install"
+      fi
+      ;;
+    npm:run)
+      local script
+      script=$(echo "${cmd}" | sed -nE 's/.*npm[[:space:]]+run[[:space:]]+([^[:space:]]+).*/\1/p')
+      if [[ -n "${script}" ]]; then
+        echo "bun run ${script}"
+      else
+        echo "bun run <script>"
+      fi
+      ;;
+    npm:test)         echo "bun test" ;;
+    npm:start)        echo "bun run start" ;;
+    npm:exec)         echo "bunx <pkg>" ;;
+    npm:init)         echo "bun init" ;;
+    npm:uninstall|npm:remove)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*npm[[:space:]]+(uninstall|remove)[[:space:]]+([^-].*)/\2/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "bun remove ${pkgs}"
+      else
+        echo "bun remove <packages>"
+      fi
+      ;;
+    npm:*)            echo "bun <equivalent>" ;;
+    npx:*)
+      local pkg
+      pkg=$(echo "${cmd}" | sed -E 's/.*npx[[:space:]]+//' | \
+        tr ' ' '\n' | grep -v '^-' | head -1)
+      if [[ -n "${pkg}" ]]; then
+        echo "bunx ${pkg}"
+      else
+        echo "bunx <pkg>"
+      fi
+      ;;
+    yarn:add)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*yarn[[:space:]]+add[[:space:]]+([^-].*)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "bun add ${pkgs}"
+      else
+        echo "bun add <packages>"
+      fi
+      ;;
+    yarn:install)     echo "bun install" ;;
+    yarn:run)
+      local script
+      script=$(echo "${cmd}" | sed -nE 's/.*yarn[[:space:]]+run[[:space:]]+([^[:space:]]+).*/\1/p')
+      if [[ -n "${script}" ]]; then
+        echo "bun run ${script}"
+      else
+        echo "bun run <script>"
+      fi
+      ;;
+    yarn:remove)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*yarn[[:space:]]+remove[[:space:]]+([^-].*)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "bun remove ${pkgs}"
+      else
+        echo "bun remove <packages>"
+      fi
+      ;;
+    yarn:*)           echo "bun <equivalent>" ;;
+    pnpm:add)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*pnpm[[:space:]]+add[[:space:]]+([^-].*)/\1/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "bun add ${pkgs}"
+      else
+        echo "bun add <packages>"
+      fi
+      ;;
+    pnpm:install)     echo "bun install" ;;
+    pnpm:run)
+      local script
+      script=$(echo "${cmd}" | sed -nE 's/.*pnpm[[:space:]]+run[[:space:]]+([^[:space:]]+).*/\1/p')
+      if [[ -n "${script}" ]]; then
+        echo "bun run ${script}"
+      else
+        echo "bun run <script>"
+      fi
+      ;;
+    pnpm:remove)
+      local pkgs
+      pkgs=$(echo "${cmd}" | sed -nE 's/.*pnpm[[:space:]]+(remove|uninstall)[[:space:]]+([^-].*)/\2/p' | \
+        sed 's/[[:space:]]*$//')
+      if [[ -n "${pkgs}" ]]; then
+        echo "bun remove ${pkgs}"
+      else
+        echo "bun remove <packages>"
+      fi
+      ;;
+    pnpm:*)           echo "bun <equivalent>" ;;
+    *)                echo "use the project-preferred tool" ;;
+  esac
+}
+
+# check_replacement_tool(tool, install_hint) — warns once per session if tool missing
+check_replacement_tool() {
+  local tool="$1"
+  local install_hint="$2"
+  if ! command -v "${tool}" >/dev/null 2>&1; then
+    local marker="/tmp/.pm_warn_${tool}_${HOOK_GUARD_PID:-${PPID}}"
+    if [[ ! -f "${marker}" ]]; then
+      echo "[hook:warning] ${tool} not found — blocked but replacement unavailable. Install: ${install_hint}" >&2
+      touch "${marker}" 2>/dev/null || true
+    fi
+  fi
+}
+
+# approve() — log if debug/log, output approve JSON, exit 0
+approve() {
+  if [[ "${HOOK_DEBUG_PM:-0}" == "1" ]]; then
+    echo "[hook:debug] PM check: command='${cmd}', action='approve'" >&2
+  fi
+  if [[ "${HOOK_LOG_PM:-0}" == "1" ]]; then
+    local log_file="/tmp/.pm_enforcement_${HOOK_GUARD_PID:-${PPID}}.log"
+    echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | approve | | | ${cmd:0:80}" >> "${log_file}" 2>/dev/null || true
+  fi
+  emit_approve "${platform}"
+  exit 0
+}
+
+# block(tool, subcmd) — compute replacement, output block JSON, exit 0
+block() {
+  local tool="$1"
+  local subcmd="${2:-}"
+  local replacement
+  replacement=$(compute_replacement_message "${tool}" "${subcmd}")
+  if [[ "${HOOK_DEBUG_PM:-0}" == "1" ]]; then
+    echo "[hook:debug] PM check: command='${cmd}', action='block', tool='${tool}', subcmd='${subcmd}'" >&2
+  fi
+  if [[ "${HOOK_LOG_PM:-0}" == "1" ]]; then
+    local log_file="/tmp/.pm_enforcement_${HOOK_GUARD_PID:-${PPID}}.log"
+    echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | block | ${tool} | ${subcmd} | ${cmd:0:80}" >> "${log_file}" 2>/dev/null || true
+  fi
+  emit_block "${platform}" "[hook:block] ${tool} is not allowed. Use: ${replacement}"
+  exit 0
+}
+
+# warn(tool, subcmd) — compute replacement, output approve JSON + advisory to stderr, exit 0
+warn() {
+  local tool="$1"
+  local subcmd="${2:-}"
+  local replacement
+  replacement=$(compute_replacement_message "${tool}" "${subcmd}")
+  if [[ "${HOOK_DEBUG_PM:-0}" == "1" ]]; then
+    echo "[hook:debug] PM check: command='${cmd}', action='warn', tool='${tool}', subcmd='${subcmd}'" >&2
+  fi
+  if [[ "${HOOK_LOG_PM:-0}" == "1" ]]; then
+    local log_file="/tmp/.pm_enforcement_${HOOK_GUARD_PID:-${PPID}}.log"
+    echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | warn | ${tool} | ${subcmd} | ${cmd:0:80}" >> "${log_file}" 2>/dev/null || true
+  fi
+  emit_approve "${platform}"
+  echo "[hook:advisory] ${tool} detected. Prefer: ${replacement}" >&2
+  exit 0
+}
+
+# enforce(mode, tool, subcmd) — dispatches to warn or block based on mode
+enforce() {
+  local mode="$1"
+  local tool="$2"
+  local subcmd="${3:-}"
+  if [[ "${mode}" == "warn" ]]; then
+    warn "${tool}" "${subcmd}"
+  else
+    block "${tool}" "${subcmd}"
+  fi
+}
+
+# ============================================================
+# Python enforcement
+# pip/python-m family: elif chain (uv pip passthrough + python -m pip download allowlist reuse)
+# poetry/pipenv: independent blocks (catches "pip diag && poetry add" compounds)
+# ============================================================
+
+WB_START='(^|[^a-zA-Z0-9_])'
+WB_END='([^a-zA-Z0-9_]|$)'
+
+py_raw=$(get_pm_enforcement "python")
+py_parsed=$(parse_pm_config "${py_raw}")
+py_mode="${py_parsed%%:*}"   # off / warn / block
+
+if [[ "${py_mode}" != "off" ]]; then
+
+  # Elif chain: uv pip passthrough + pip/python-m family
+  # (elif required: "uv pip" contains "pip"; python -m pip download allowlist reused via fallthrough)
+  if   [[ "${cmd}" =~ ${WB_START}uv[[:space:]]+pip ]]; then
+    approve   # uv pip passthrough — exits
+
+  elif [[ "${cmd}" =~ ${WB_START}pip3?[[:space:]]+([a-zA-Z]+) ]]; then
+    subcmd="${BASH_REMATCH[2]}"
+    # shellcheck disable=SC2310  # is_allowed_subcommand is an intentional predicate
+    if ! is_allowed_subcommand "pip" "${subcmd}"; then
+      check_replacement_tool "uv" "brew install uv"
+      enforce "${py_mode}" "pip" "${subcmd}"
+    fi
+  elif [[ "${cmd}" =~ ${WB_START}pip3?[[:space:]]+(--version|-[vVh]|--help)${WB_END} ]]; then
+    :   # diagnostic no-op
+  elif [[ "${cmd}" =~ ${WB_START}pip3?${WB_END} ]]; then
+    check_replacement_tool "uv" "brew install uv"
+    enforce "${py_mode}" "pip"
+  elif [[ "${cmd}" =~ ${WB_START}python3?[[:space:]]+-m[[:space:]]+pip${WB_END} ]]; then
+    check_replacement_tool "uv" "brew install uv"
+    enforce "${py_mode}" "python -m pip"
+  elif [[ "${cmd}" =~ ${WB_START}python3?[[:space:]]+-m[[:space:]]+venv${WB_END} ]]; then
+    check_replacement_tool "uv" "brew install uv"
+    enforce "${py_mode}" "python -m venv"
+  fi
+
+  # Independent: poetry (now catches "pip --version && poetry add" compounds)
+  if   [[ "${cmd}" =~ ${WB_START}poetry[[:space:]]+([a-zA-Z]+) ]]; then
+    subcmd="${BASH_REMATCH[2]}"
+    # shellcheck disable=SC2310  # is_allowed_subcommand is an intentional predicate
+    if ! is_allowed_subcommand "poetry" "${subcmd}"; then
+      check_replacement_tool "uv" "brew install uv"
+      enforce "${py_mode}" "poetry" "${subcmd}"
+    fi
+  elif [[ "${cmd}" =~ ${WB_START}poetry[[:space:]]+(--version|-[vVh]|--help)${WB_END} ]]; then
+    :   # diagnostic no-op
+  elif [[ "${cmd}" =~ ${WB_START}poetry${WB_END} ]]; then
+    check_replacement_tool "uv" "brew install uv"
+    enforce "${py_mode}" "poetry"
+  fi
+
+  # Independent: pipenv (now catches "pip --version && pipenv install" compounds)
+  if   [[ "${cmd}" =~ ${WB_START}pipenv[[:space:]]+([a-zA-Z]+) ]]; then
+    subcmd="${BASH_REMATCH[2]}"
+    # shellcheck disable=SC2310  # is_allowed_subcommand is an intentional predicate
+    if ! is_allowed_subcommand "pipenv" "${subcmd}"; then
+      check_replacement_tool "uv" "brew install uv"
+      enforce "${py_mode}" "pipenv" "${subcmd}"
+    fi
+  elif [[ "${cmd}" =~ ${WB_START}pipenv[[:space:]]+(--version|-[vVh]|--help)${WB_END} ]]; then
+    :   # diagnostic no-op
+  elif [[ "${cmd}" =~ ${WB_START}pipenv${WB_END} ]]; then
+    check_replacement_tool "uv" "brew install uv"
+    enforce "${py_mode}" "pipenv"
+  fi
+
+fi
+
+# ============================================================
+# JavaScript enforcement (independent if blocks — required for compound cmd safety)
+# ============================================================
+
+js_raw=$(get_pm_enforcement "javascript")
+js_parsed=$(parse_pm_config "${js_raw}")
+js_mode="${js_parsed%%:*}"   # off / warn / block
+
+if [[ "${js_mode}" != "off" ]]; then
+
+  # npm (independent check — elif chain within this block)
+  if   [[ "${cmd}" =~ ${WB_START}npm[[:space:]]+([a-zA-Z]+) ]]; then
+    subcmd="${BASH_REMATCH[2]}"
+    # shellcheck disable=SC2310  # is_allowed_subcommand is an intentional predicate
+    if ! is_allowed_subcommand "npm" "${subcmd}"; then
+      check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+      enforce "${js_mode}" "npm" "${subcmd}"
+    fi
+  elif [[ "${cmd}" =~ ${WB_START}npm[[:space:]]+-[^[:space:]]*[[:space:]]+([a-zA-Z]+) ]]; then
+    subcmd="${BASH_REMATCH[2]}"
+    # shellcheck disable=SC2310  # is_allowed_subcommand is an intentional predicate
+    if ! is_allowed_subcommand "npm" "${subcmd}"; then
+      check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+      enforce "${js_mode}" "npm" "${subcmd}"
+    fi
+  elif [[ "${cmd}" =~ ${WB_START}npm[[:space:]]+(--version|-[vVh]|--help)${WB_END} ]]; then
+    :   # diagnostic no-op
+  elif [[ "${cmd}" =~ ${WB_START}npm[[:space:]]+-[^[:space:]]* ]]; then
+    check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+    enforce "${js_mode}" "npm"
+  elif [[ "${cmd}" =~ ${WB_START}npm${WB_END} ]]; then
+    check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+    enforce "${js_mode}" "npm"
+  fi
+
+  # npx (independent check)
+  if   [[ "${cmd}" =~ ${WB_START}npx[[:space:]]+(--version|-[vVh]|--help)${WB_END} ]]; then
+    :   # diagnostic no-op
+  elif [[ "${cmd}" =~ ${WB_START}npx${WB_END} ]]; then
+    check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+    enforce "${js_mode}" "npx"
+  fi
+
+  # yarn (independent check — not elif from npm)
+  if   [[ "${cmd}" =~ ${WB_START}yarn[[:space:]]+([a-zA-Z]+) ]]; then
+    subcmd="${BASH_REMATCH[2]}"
+    # shellcheck disable=SC2310  # is_allowed_subcommand is an intentional predicate
+    if ! is_allowed_subcommand "yarn" "${subcmd}"; then
+      check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+      enforce "${js_mode}" "yarn" "${subcmd}"
+    fi
+  elif [[ "${cmd}" =~ ${WB_START}yarn[[:space:]]+(--version|-[vVh]|--help)${WB_END} ]]; then
+    :   # diagnostic no-op
+  elif [[ "${cmd}" =~ ${WB_START}yarn${WB_END} ]]; then
+    check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+    enforce "${js_mode}" "yarn" "install"
+  fi
+
+  # pnpm (independent check — not elif from yarn)
+  if   [[ "${cmd}" =~ ${WB_START}pnpm[[:space:]]+([a-zA-Z]+) ]]; then
+    subcmd="${BASH_REMATCH[2]}"
+    # shellcheck disable=SC2310  # is_allowed_subcommand is an intentional predicate
+    if ! is_allowed_subcommand "pnpm" "${subcmd}"; then
+      check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+      enforce "${js_mode}" "pnpm" "${subcmd}"
+    fi
+  elif [[ "${cmd}" =~ ${WB_START}pnpm[[:space:]]+(--version|-[vVh]|--help)${WB_END} ]]; then
+    :   # diagnostic no-op
+  elif [[ "${cmd}" =~ ${WB_START}pnpm${WB_END} ]]; then
+    check_replacement_tool "bun" "curl -fsSL https://bun.sh/install | bash"
+    enforce "${js_mode}" "pnpm" "install"
+  fi
+
+fi
+
+emit_approve "${platform}"
+exit 0
diff --git a/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/platform_shim.sh b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/platform_shim.sh
new file mode 100644
index 0000000..fc59571
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/platform_shim.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+# platform_shim.sh - Normalize Claude Code and Copilot CLI hook I/O
+
+detect_platform() {
+  local input="${1:-}"
+  if [[ "${input}" == *'"toolName"'* ]]; then
+    echo "copilot"
+    return 0
+  fi
+  if jaq -e 'has("toolName")' <<<"${input}" >/dev/null 2>&1; then
+    echo "copilot"
+  else
+    echo "claude"
+  fi
+}
+
+get_tool_name() {
+  local input="${1:-}"
+  local platform="${2:-$(detect_platform "${input}")}"
+  local raw=""
+  if [[ "${platform}" == "copilot" ]]; then
+    raw=$(jaq -r '.toolName // empty' <<<"${input}" 2>/dev/null) || raw=""
+    case "${raw}" in
+      edit) echo "Edit" ;;
+      create|write) echo "Write" ;;
+      bash) echo "Bash" ;;
+      *) echo "${raw}" ;;
+    esac
+  else
+    raw=$(jaq -r '.tool_name // empty' <<<"${input}" 2>/dev/null) || raw=""
+    echo "${raw}"
+  fi
+}
+
+get_tool_input() {
+  local input="${1:-}"
+  local platform="${2:-$(detect_platform "${input}")}"
+  if [[ "${platform}" == "copilot" ]]; then
+    local args=""
+    args=$(jaq -r '.toolArgs // empty' <<<"${input}" 2>/dev/null) || args=""
+    if [[ -z "${args}" ]]; then
+      echo "{}"
+    else
+      jaq -cn --arg args "${args}" '$args | fromjson? // {}' 2>/dev/null || echo "{}"
+    fi
+  else
+    jaq -c '.tool_input // {}' <<<"${input}" 2>/dev/null || echo "{}"
+  fi
+}
+
+get_file_path() {
+  local input="${1:-}"
+  local platform="${2:-$(detect_platform "${input}")}"
+  local tool_input
+  tool_input=$(get_tool_input "${input}" "${platform}")
+  jaq -r '.file_path // .filePath // .path // empty' <<<"${tool_input}" 2>/dev/null || echo ""
+}
+
+get_command() {
+  local input="${1:-}"
+  local platform="${2:-$(detect_platform "${input}")}"
+  local tool_input
+  tool_input=$(get_tool_input "${input}" "${platform}")
+  jaq -r '.command // empty' <<<"${tool_input}" 2>/dev/null || echo ""
+}
+
+emit_approve() {
+  local platform="${1:-claude}"
+  if [[ "${platform}" == "copilot" ]]; then
+    echo '{"permissionDecision":"allow"}'
+  else
+    echo '{"decision":"approve"}'
+  fi
+}
+
+emit_block() {
+  local platform="${1:-claude}"
+  local reason="${2:-Blocked by hook policy}"
+  local system_message="${3:-}"
+  if [[ "${platform}" == "copilot" ]]; then
+    jaq -n --arg reason "${reason}" \
+      '{"permissionDecision":"deny","permissionDecisionReason":$reason}'
+  else
+    if [[ -n "${system_message}" ]]; then
+      jaq -n --arg reason "${reason}" --arg msg "${system_message}" \
+        '{"decision":"block","reason":$reason,"systemMessage":$msg}'
+    else
+      jaq -n --arg reason "${reason}" \
+        '{"decision":"block","reason":$reason}'
+    fi
+  fi
+}
diff --git a/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/protect_linter_configs.sh b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/protect_linter_configs.sh
new file mode 100755
index 0000000..30fec15
--- /dev/null
+++ b/.github/skills/agent-native/references/agent-hooks/pre-tool-hooks/protect_linter_configs.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+# protect_linter_configs.sh - Claude Code PreToolUse hook
+# shellcheck disable=SC2310  # functions in if/|| is intentional
+# Blocks modification of linter configuration files (defense layer 4)
+#
+# Protected files define code quality standards. Modifying them to make
+# violations disappear (instead of fixing the code) is rule-gaming behavior.
+#
+# Output: JSON schema per PreToolUse spec
+#   {"decision": "approve"} - Allow operation
+#   {"decision": "block", "reason": "..."} - Block operation
+
+set -euo pipefail
+
+script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=.claude/hooks/platform_shim.sh
+source "${script_dir}/platform_shim.sh"
+
+# Read JSON input from stdin
+input=$(cat)
+platform=$(detect_platform "${input}")
+tool_name=$(get_tool_name "${input}" "${platform}")
+
+# Copilot has no matcher support; filter non-edit tools here.
+if [[ "${tool_name}" != "Edit" ]] && [[ "${tool_name}" != "Write" ]]; then
+  emit_approve "${platform}"
+  exit 0
+fi
+
+# Extract file path from normalized tool input
+file_path=$(get_file_path "${input}" "${platform}")
+
+# Skip if no file path (approve with valid JSON)
+if [[ -z "${file_path}" ]]; then
+  emit_approve "${platform}"
+  exit 0
+fi
+
+# Get basename for matching
+basename=$(basename "${file_path}")
+
+# Path-based protection for .claude/ directory
+# Protects entire hooks directory and settings files
+if [[ "${file_path}" == *"/.claude/hooks/"* ]] \
+  || [[ "${file_path}" == *"/.claude/settings.json" ]] \
+  || [[ "${file_path}" == *"/.claude/settings.local.json" ]]; then
+  emit_block "${platform}" "Protected Claude Code config (${basename}). Hook scripts and settings are immutable."
+  exit 0
+fi
+
+# Load protected files from config, or use defaults
+load_protected_files() {
+  local config_file="${CLAUDE_PROJECT_DIR:-.}/.claude/hooks/config.json"
+  if [[ -f "${config_file}" ]] && command -v jaq >/dev/null 2>&1; then
+    local files
+    files=$(jaq -r '.protected_files // [] | .[]' "${config_file}" 2>/dev/null)
+    if [[ -n "${files}" ]]; then
+      echo "${files}"
+      return
+    fi
+  fi
+  # Default protected files
+  printf '%s\n' \
+    ".markdownlint.jsonc" ".markdownlint-cli2.jsonc" ".shellcheckrc" \
+    ".yamllint" ".hadolint.yaml" ".jscpd.json" ".flake8" \
+    "taplo.toml" ".ruff.toml" "ty.toml" \
+    "biome.json" ".oxlintrc.json" ".semgrep.yml" "knip.json"
+}
+
+# Check if basename matches a protected linter config file
+is_protected_config() {
+  local check_basename="$1"
+  local protected_file
+  while IFS= read -r protected_file; do
+    [[ -z "${protected_file}" ]] && continue
+    if [[ "${check_basename}" == "${protected_file}" ]]; then
+      return 0
+    fi
+  done < <(load_protected_files || true)
+  return 1
+}
+
+# Check if this is a protected linter config file
+if is_protected_config "${basename}"; then
+  emit_block "${platform}" "Protected linter config file (${basename}). Fix the code, not the rules."
+  exit 0
+fi
+
+# Not a protected file, allow operation
+emit_approve "${platform}"
+exit 0
diff --git a/.github/skills/agent-native/references/browser-tools/README.md b/.github/skills/agent-native/references/browser-tools/README.md
new file mode 100644
index 0000000..5745838
--- /dev/null
+++ b/.github/skills/agent-native/references/browser-tools/README.md
@@ -0,0 +1,122 @@
+# Browser Tools for Agent Harnesses
+
+Coding agents are blind to what their code does in the browser. They can read source files but can't see runtime behavior — layout breaks, console errors, network failures, slow renders. Browser tools close this gap.
+
+## Why This Matters for Harness Engineering
+
+The verify loop isn't complete without runtime validation. An agent can pass lint and tests but still ship a broken UI. Browser tools let agents:
+
+- **See what users see** — take snapshots of the rendered page
+- **Debug runtime errors** — read console logs, inspect network requests
+- **Verify fixes** — navigate, interact, confirm the bug is gone
+- **Profile performance** — measure Core Web Vitals, find bottlenecks
+
+## Two Approaches
+
+| Tool | Type | Best For | Setup |
+|---|---|---|---|
+| **Playwright CLI** | CLI (skill) | Coding agents (Copilot CLI, Claude Code) | `npm i -g @playwright/cli` |
+| **Chrome DevTools MCP** | MCP server | Any MCP-compatible agent | MCP config entry |
+
+They're complementary — Playwright CLI is better for agent-driven automation (snapshot → interact → verify), Chrome DevTools MCP is better for deep debugging (performance traces, network analysis, DOM inspection).
+
+## Recommendation
+
+- **Start with Playwright CLI** — it's simpler, works as an agent skill, and covers the 80% case (navigate, interact, snapshot, verify)
+- **Add Chrome DevTools MCP** when you need performance profiling, network analysis, or deep DOM inspection
+- **For frontend-heavy projects**, use both
+
+## Setup
+
+### Playwright CLI
+
+Install globally:
+```bash
+npm install -g @playwright/cli
+```
+
+The skill is auto-discovered by Copilot and Claude Code when installed. For explicit use:
+```bash
+# Open a page, take a snapshot, interact
+playwright-cli open http://localhost:3000
+playwright-cli snapshot
+playwright-cli click e5
+playwright-cli fill e7 "test@example.com"
+playwright-cli screenshot --filename=after-fix.png
+playwright-cli close
+```
+
+See `playwright-cli/` for the full skill reference.
+
+### Chrome DevTools MCP
+
+Add to your MCP config (`.github/copilot-mcp.json`, `claude_desktop_config.json`, etc.):
+```json
+{
+  "mcpServers": {
+    "chrome-devtools": {
+      "command": "npx",
+      "args": ["chrome-devtools-mcp@latest"]
+    }
+  }
+}
+```
+
+See `chrome-devtools-mcp/` for the full skill reference.
+
+## Harness Integration
+
+### Verify Loop With Browser Checks
+
+```bash
+# In scripts/harness/smoke.sh or verify script:
+
+# 1. Static checks
+make check
+
+# 2. Run tests
+make test
+
+# 3. Browser verification (if app is running)
+if curl -s http://localhost:3000 > /dev/null 2>&1; then
+  playwright-cli open http://localhost:3000
+  playwright-cli snapshot --filename=.harness/page-snapshot.yml
+  playwright-cli console  # check for JS errors
+  playwright-cli close
+fi
+```
+
+### PostToolUse Hook for Auto-Verify
+
+After file edits in a frontend project, auto-snapshot the running app:
+
+```json
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "type": "command",
+        "command": "./scripts/hooks/browser-verify.sh",
+        "timeout": 30
+      }
+    ]
+  }
+}
+```
+
+```bash
+#!/bin/bash
+INPUT=$(cat)
+TOOL_NAME=$(echo "$INPUT" | jq -r '.toolName')
+
+# Only verify after file edits
+if [[ "$TOOL_NAME" == "edit" || "$TOOL_NAME" == "create" ]]; then
+  FILE=$(echo "$INPUT" | jq -r '.toolArgs' | jq -r '.path // .files[0]')
+  # Only for frontend files
+  if [[ "$FILE" =~ \.(tsx?|jsx?|css|html|vue|svelte)$ ]]; then
+    # Wait for hot reload
+    sleep 2
+    playwright-cli snapshot --filename=.harness/post-edit-snapshot.yml 2>/dev/null || true
+  fi
+fi
+```
diff --git a/.github/skills/agent-native/references/browser-tools/chrome-devtools-mcp/README.md b/.github/skills/agent-native/references/browser-tools/chrome-devtools-mcp/README.md
new file mode 100644
index 0000000..6f09527
--- /dev/null
+++ b/.github/skills/agent-native/references/browser-tools/chrome-devtools-mcp/README.md
@@ -0,0 +1,146 @@
+# Chrome DevTools MCP — Agent Skill Reference
+
+**Source:** [ChromeDevTools/chrome-devtools-mcp](https://github.com/ChromeDevTools/chrome-devtools-mcp/)
+**Type:** MCP server (Model Context Protocol)
+**Compatibility:** Any MCP-compatible agent (Copilot, Claude Code, Gemini CLI, Cursor, etc.)
+**Official blog:** [Chrome for Developers](https://developer.chrome.com/blog/chrome-devtools-mcp)
+
+---
+
+## Overview
+
+MCP server that gives AI agents access to Chrome DevTools Protocol. Deep browser debugging — performance profiling, network analysis, DOM inspection, console monitoring, device emulation. Complements Playwright CLI: use Playwright for interaction, DevTools MCP for deep inspection.
+
+## Setup
+
+Add to your MCP config:
+
+**Copilot (VS Code)** — `.github/copilot-mcp.json` or VS Code settings:
+```json
+{
+  "mcpServers": {
+    "chrome-devtools": {
+      "command": "npx",
+      "args": ["chrome-devtools-mcp@latest"]
+    }
+  }
+}
+```
+
+**Claude Code** — `.mcp.json` or `claude_desktop_config.json`:
+```json
+{
+  "mcpServers": {
+    "chrome-devtools": {
+      "command": "npx",
+      "args": ["chrome-devtools-mcp@latest"]
+    }
+  }
+}
+```
+
+**Copilot CLI** — `~/.copilot/config.json` or per-project config.
+
+## Tool Categories
+
+### Navigation & Page Management
+- `new_page` — open a new tab
+- `navigate_page` — go to URL, reload, back/forward
+- `select_page` — switch between open tabs
+- `list_pages` — see all open tabs and their IDs
+- `close_page` — close a tab
+- `wait_for` — wait for text to appear
+
+### Input & Interaction
+- `click` — click element (use `uid` from snapshot)
+- `fill` / `fill_form` — type into inputs or fill multiple fields
+- `hover` — mouse over element
+- `press_key` — keyboard shortcuts (`Enter`, `Control+C`)
+- `drag` — drag and drop
+- `handle_dialog` — accept/dismiss alerts
+- `upload_file` — file input upload
+
+### Debugging & Inspection
+- `take_snapshot` — accessibility tree (best for identifying elements)
+- `take_screenshot` — visual capture
+- `list_console_messages` / `get_console_message` — console output
+- `evaluate_script` — run JS in page context
+- `list_network_requests` / `get_network_request` — network traffic
+
+### Performance & Emulation
+- `resize_page` — viewport dimensions
+- `emulate` — CPU/network throttling, geolocation
+- `performance_start_trace` — start performance recording
+- `performance_stop_trace` — stop and save trace
+- `performance_analyze_insight` — automated analysis of Core Web Vitals
+
+## When to Use (vs Playwright CLI)
+
+| Scenario | Playwright CLI | Chrome DevTools MCP |
+|---|---|---|
+| Navigate + interact | ✅ Better (CLI-driven, fast) | ✅ Works |
+| Take snapshots | ✅ Primary method | ✅ Works |
+| Console errors | ✅ `console` command | ✅ More detailed |
+| Network analysis | Basic (`network`) | ✅ **Full request/response details** |
+| Performance profiling | Basic (`tracing-start`) | ✅ **Core Web Vitals, LCP analysis** |
+| Device emulation | Basic (`resize`) | ✅ **Full emulation (CPU, network, geo)** |
+| Request mocking | ✅ `route` command | ❌ Not available |
+| Cookie/storage mgmt | ✅ Full support | ❌ Not available |
+
+**Rule of thumb:** Start with Playwright CLI for interaction. Add DevTools MCP when you need the "why" behind performance or network issues.
+
+## Workflow Patterns
+
+### Pattern A: Identify Elements (Snapshot-First)
+```
+1. take_snapshot → get accessibility tree with uid values
+2. Find target element's uid
+3. click(uid=...) or fill(uid=..., value=...)
+```
+
+### Pattern B: Troubleshoot Errors
+```
+1. list_console_messages → check for JS errors
+2. list_network_requests → identify 4xx/5xx failures
+3. evaluate_script → inspect DOM state
+```
+
+### Pattern C: Performance Audit
+```
+1. performance_start_trace(reload=true, autoStop=true)
+2. Wait for page load
+3. performance_analyze_insight → get LCP, layout shift, bottleneck analysis
+```
+
+## Harness Engineering Integration
+
+### Add to Project MCP Config
+
+Create `.github/copilot-mcp.json` (committed, shared with team):
+```json
+{
+  "mcpServers": {
+    "chrome-devtools": {
+      "command": "npx",
+      "args": ["chrome-devtools-mcp@latest"]
+    }
+  }
+}
+```
+
+### Performance Gate in Verify Loop
+
+```markdown
+After fixing a performance issue:
+1. Start the app (`make dev`)
+2. Use chrome-devtools MCP: `performance_start_trace` on the target page
+3. Check that LCP < 2.5s, CLS < 0.1
+4. If failing, investigate with `performance_analyze_insight`
+```
+
+## Best Practices
+
+- **Use snapshots over screenshots** — snapshots give `uid` values needed for interaction, and use fewer tokens
+- **Re-snapshot after DOM changes** — `uid` values may shift
+- **Use `list_pages` + `select_page`** when working with multiple tabs
+- **Set reasonable timeouts** for `wait_for` — don't hang on slow elements
diff --git a/.github/skills/agent-native/references/browser-tools/playwright-cli/README.md b/.github/skills/agent-native/references/browser-tools/playwright-cli/README.md
new file mode 100644
index 0000000..79a02f6
--- /dev/null
+++ b/.github/skills/agent-native/references/browser-tools/playwright-cli/README.md
@@ -0,0 +1,157 @@
+# Playwright CLI — Agent Skill Reference
+
+**Source:** [microsoft/playwright-cli](https://github.com/microsoft/playwright-cli)
+**Install:** `npm install -g @playwright/cli`
+**Compatibility:** GitHub Copilot (VS Code, CLI, coding agent), Claude Code, any skills-compatible agent
+
+---
+
+## Overview
+
+CLI for browser automation — navigate, interact, snapshot, screenshot. Designed specifically for coding agents. Uses accessibility snapshots (not screenshots) as the primary way to "see" pages, with element refs for precise interaction.
+
+## Install
+
+```bash
+npm install -g @playwright/cli
+# or per-project
+npx playwright-cli open https://example.com
+```
+
+Skills are auto-discovered by compatible agents when installed globally.
+
+## Quick Start
+
+```bash
+playwright-cli open http://localhost:3000   # launch browser
+playwright-cli snapshot                      # get accessibility tree with element refs
+playwright-cli click e15                     # click element by ref
+playwright-cli fill e7 "user@example.com"    # type into input
+playwright-cli screenshot                    # capture visual
+playwright-cli close                         # done
+```
+
+## Core Commands
+
+### Navigation
+```bash
+playwright-cli open https://example.com
+playwright-cli goto https://example.com/page
+playwright-cli go-back
+playwright-cli go-forward
+playwright-cli reload
+```
+
+### Interaction (use refs from snapshot)
+```bash
+playwright-cli click e3
+playwright-cli dblclick e7
+playwright-cli fill e5 "value"
+playwright-cli type "search query"       # types into focused element
+playwright-cli press Enter
+playwright-cli hover e4
+playwright-cli select e9 "option-value"
+playwright-cli check e12
+playwright-cli uncheck e12
+playwright-cli drag e2 e8
+playwright-cli upload ./file.pdf
+```
+
+### Inspection
+```bash
+playwright-cli snapshot                          # accessibility tree (primary)
+playwright-cli snapshot --filename=state.yml     # save to file
+playwright-cli screenshot                        # visual capture
+playwright-cli screenshot --filename=page.png
+playwright-cli eval "document.title"             # run JS
+playwright-cli eval "el => el.textContent" e5    # run JS on element
+playwright-cli console                           # JS console messages
+playwright-cli console warning                   # filter by level
+playwright-cli network                           # network requests
+```
+
+### Tabs
+```bash
+playwright-cli tab-list
+playwright-cli tab-new https://example.com
+playwright-cli tab-select 0
+playwright-cli tab-close
+```
+
+### DevTools / Tracing
+```bash
+playwright-cli tracing-start
+playwright-cli tracing-stop          # saves trace file
+playwright-cli video-start
+playwright-cli video-stop video.webm
+```
+
+### Network Mocking
+```bash
+playwright-cli route "**/*.jpg" --status=404
+playwright-cli route "https://api.example.com/**" --body='{"mock": true}'
+playwright-cli route-list
+playwright-cli unroute "**/*.jpg"
+```
+
+### State Management
+```bash
+playwright-cli state-save auth.json      # save cookies + localStorage
+playwright-cli state-load auth.json      # restore state
+playwright-cli cookie-list
+playwright-cli cookie-set session abc123
+playwright-cli localstorage-get theme
+playwright-cli localstorage-set theme dark
+```
+
+### Sessions
+```bash
+playwright-cli -s=mysession open --persistent   # named session with persistent profile
+playwright-cli -s=mysession click e6
+playwright-cli -s=mysession close
+playwright-cli list                              # list all sessions
+playwright-cli close-all
+```
+
+## Key Pattern: Snapshot-First
+
+**Always use `snapshot` before interacting.** The snapshot provides element `refs` (e.g., `e15`) that interaction commands need.
+
+```bash
+playwright-cli open http://localhost:3000
+playwright-cli snapshot           # → find e7 is the email input
+playwright-cli fill e7 "test@example.com"
+playwright-cli click e12          # submit button
+playwright-cli snapshot           # verify result
+```
+
+**Take a new snapshot after navigation or major DOM changes** — refs may change.
+
+## Harness Engineering Patterns
+
+### Pattern: Verify After Fix
+```bash
+# Agent makes a code change, then:
+playwright-cli snapshot --filename=.harness/after-fix.yml
+playwright-cli console              # check for new JS errors
+playwright-cli screenshot --filename=.harness/after-fix.png
+```
+
+### Pattern: Form Submission Test
+```bash
+playwright-cli open http://localhost:3000/form
+playwright-cli snapshot
+playwright-cli fill e1 "user@example.com"
+playwright-cli fill e2 "password123"
+playwright-cli click e3
+playwright-cli snapshot   # verify success state
+playwright-cli close
+```
+
+### Pattern: Debugging
+```bash
+playwright-cli open http://localhost:3000
+playwright-cli console              # check JS errors
+playwright-cli network              # check failed requests
+playwright-cli eval "document.querySelectorAll('.error').length"
+```
diff --git a/.github/skills/agent-native/references/openai-harness-practices.md b/.github/skills/agent-native/references/openai-harness-practices.md
new file mode 100644
index 0000000..1133f80
--- /dev/null
+++ b/.github/skills/agent-native/references/openai-harness-practices.md
@@ -0,0 +1,36 @@
+# OpenAI Harness Practices Mapping
+
+This file maps each practice from OpenAI's Harness Engineering guidance to concrete repo artifacts.
+
+## Sources
+
+- Harness Engineering: https://openai.com/index/harness-engineering/
+- Using `PLANS.md` for multi-hour tasks: https://cookbook.openai.com/articles/plan-driven-workflow
+- Data-shape boundary design reference: https://matklad.github.io/2023/08/17/types-are-parse-don-t-validate.html
+- Boundary-first architecture reference: https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html
+
+## Practice Matrix
+
+| Practice | What To Implement | Required Artifacts | Verification |
+|---|---|---|---|
+| 1. Make easy to do hard thing | Single-command wrappers for high-value tasks. | `Makefile.harness`, `scripts/harness/*.sh` | `make smoke`, `make check`, and `make ci` run without manual prep. |
+| 2. Communicate actionable constraints with compact docs | Command-first guardrails and operational constraints. | `AGENTS.md` | Agent can execute common tasks without guessing undocumented behavior. |
+| 3. Structure codebase with strict boundaries and flow | Clear boundaries, typed contracts, boundary parsing. | `docs/ARCHITECTURE.md` | Data transformations happen at edges; internals are simpler and traceable. |
+| 4. Build observability in from day 1 | Structured events/logs and correlation IDs. | `docs/OBSERVABILITY.md` | Every critical transition has traceable identifiers and stable fields. |
+| 5. Optimize for agent flow, not human flow | Durable, resumable planning context. | `PLANS.md` | Long tasks remain reproducible after interruptions or handoffs. |
+| 6. Bring your own harness | Repo-local deterministic workflows (no hidden UI/manual steps). | `Makefile.harness`, `scripts/harness/` | Same commands work in local and CI environments. |
+| 7. Prototype in natural language first | Prose-first logic drafts before code. | `PLANS.md` sections for behavior/testing intent | First implementation pass has fewer reworks and edge-case misses. |
+| 8. Invest in static analysis and linting | Fast-fail checks before expensive runs. | `Makefile.harness`, CI workflow | Lint/typecheck break builds early; test time is spent on validated code. |
+| 9. Manage entropy | Scheduled audits and drift control. | `scripts/audit_harness.sh`, CI integration | Harness docs/scripts stay aligned with repo behavior over time. |
+
+## Non-Negotiables
+
+1. Keep command entrypoints stable (`make smoke`, `make check`, `make ci`).
+2. Keep docs compact and executable, not narrative-heavy.
+3. Keep scripts deterministic and machine-readable.
+4. Keep architecture boundaries explicit and reviewed during refactors.
+5. Keep observability fields stable to support aggregation and replay.
+
+## Wizard Entry Point
+
+Use `scripts/harness_wizard.py` as the stable orchestration layer for bootstrap, primitive upgrades, and auditing.
diff --git a/.github/skills/agent-native/references/rollout-checklist.md b/.github/skills/agent-native/references/rollout-checklist.md
new file mode 100644
index 0000000..9c399f5
--- /dev/null
+++ b/.github/skills/agent-native/references/rollout-checklist.md
@@ -0,0 +1,39 @@
+# Harness Rollout Checklist
+
+Use this staged checklist when integrating the harness into an existing repository with active development.
+
+## Phase 0: Baseline
+
+- [ ] Record current build/test/lint/typecheck entrypoints.
+- [ ] Identify flaky checks and long-running hot spots.
+- [ ] Confirm required environments (local, CI, containers, services).
+- [ ] Create a starter entry in `PLANS.md` with scope and constraints.
+
+## Phase 1: Bootstrap
+
+- [ ] Run `python3 scripts/harness_wizard.py init <repo-path> --profile control`.
+- [ ] Verify generated files are present.
+- [ ] Customize template placeholders for project-specific commands.
+- [ ] Confirm `Makefile` includes `Makefile.harness`.
+
+## Phase 2: Practice Alignment
+
+- [ ] Validate all nine practices against real workflows.
+- [ ] Tighten `AGENTS.md` so high-probability tasks are one command each.
+- [ ] Update `docs/ARCHITECTURE.md` with concrete module boundaries.
+- [ ] Add observability identifiers in logs/events and document them.
+- [ ] Make static analysis and type checks mandatory before full test runs.
+
+## Phase 3: Automation + Entropy Control
+
+- [ ] Enable `.github/workflows/harness.yml` (or equivalent CI job).
+- [ ] Run `python3 scripts/harness_wizard.py audit <repo-path>` in CI.
+- [ ] Add periodic review cadence for docs/scripts drift.
+- [ ] Remove stale scripts and outdated docs to keep context clean.
+
+## Exit Criteria
+
+- [ ] New contributors can run harness commands without extra tribal knowledge.
+- [ ] Agent runs are reproducible from clean checkout.
+- [ ] Core workflows are observable and debuggable.
+- [ ] Harness audit passes consistently.
diff --git a/.github/skills/agent-native/references/static-analysis.md b/.github/skills/agent-native/references/static-analysis.md
new file mode 100644
index 0000000..fb30c6f
--- /dev/null
+++ b/.github/skills/agent-native/references/static-analysis.md
@@ -0,0 +1,104 @@
+# Static Analysis & Linting
+
+## Purpose
+
+Fast feedback before expensive test runs. Linting catches typos, unused imports, type errors, and style violations in seconds — before the agent spends minutes running tests against broken code.
+
+For harness engineering, linting is the **first gate**: `make check` (lint + typecheck) runs before `make test`. If lint fails, don't bother testing.
+
+## Principles
+
+1. **Fast** — lint should finish in under 10 seconds for most projects
+2. **Deterministic** — same code, same result, every time. Pin versions.
+3. **Fixable** — prefer linters with `--fix` / autofix so agents can self-repair
+4. **Parseable** — use JSON/SARIF output when available so agents can read results programmatically
+5. **Opinionated defaults** — start strict, relax only with justification
+
+## Recommended Tools by Language
+
+### JavaScript / TypeScript
+- **ESLint** — the standard. Use flat config (`eslint.config.js`).
+  - Recommended: `@eslint/js` recommended + `typescript-eslint` strict
+  - `eslint --format json` for agent-parseable output
+  - `eslint --fix` for auto-repair
+- **Biome** — faster alternative (Rust-based). Lint + format in one tool.
+  - `biome check --reporter json` for parseable output
+  - Good for new projects; ESLint has broader plugin ecosystem
+
+### Python
+- **Ruff** — extremely fast (Rust-based), replaces flake8 + isort + pyupgrade + dozens more
+  - `ruff check --output-format json` for agent-parseable output
+  - `ruff check --fix` for auto-repair
+  - Config in `pyproject.toml` under `[tool.ruff]`
+- **mypy** or **pyright** for type checking (separate from lint, runs via `make typecheck`)
+
+### Go
+- **golangci-lint** — meta-linter wrapping 50+ Go linters
+  - `golangci-lint run --out-format json` for parseable output
+  - Config in `.golangci.yml`
+
+### Rust
+- **clippy** — built-in, no setup needed
+  - `cargo clippy --all-targets -- -D warnings`
+  - JSON output: `cargo clippy --message-format json`
+
+### C# / .NET
+- **dotnet format** — built-in formatter + analyzer
+  - `dotnet format --verify-no-changes` for CI
+  - Roslyn analyzers for deeper checks (configure in `.editorconfig`)
+
+### Multi-Language / Monorepo
+- **MegaLinter** — runs 50+ linters across languages in Docker
+  - Best for: monorepos with many services in different languages
+  - `mega-linter-runner` for local use, GitHub Action for CI
+  - Config in `.mega-linter.yml` — enable only what you need
+  - Heavy (~minutes), so use per-language linters for tight agent loops and MegaLinter for CI gate
+- **Trunk Check** (`trunk.io`) — single CLI, auto-detects language, runs appropriate linters
+  - Lighter than MegaLinter, good local experience
+  - Proprietary dependency (free tier available)
+
+## The lint.sh Script
+
+The harness template ships `scripts/harness/lint.sh` with auto-detection:
+
+1. If `HARNESS_LINT_CMD` env var is set → run that (explicit override)
+2. Detect language from project files (package.json, pyproject.toml, Cargo.toml, go.mod)
+3. Run the appropriate linter with sane defaults
+4. Exit 0 on clean, non-zero on violations
+
+**Customize it.** The auto-detection is a starting point. For most projects, replace the body of `lint.sh` with your specific linter invocation.
+
+## Agent-Friendly Output
+
+When configuring linters, prefer JSON output for agent consumption:
+
+```bash
+# Instead of:
+eslint src/
+
+# Use:
+eslint src/ --format json > .harness/lint-results.json
+
+# Agent can then:
+jq '.[] | .messages[] | select(.severity == 2)' .harness/lint-results.json
+```
+
+The verify loop can check lint results alongside runtime telemetry:
+
+```bash
+make lint                              # fast static check
+make test                              # integration tests  
+jq 'select(.level == "ERROR")' .harness/logs.jsonl  # runtime errors
+```
+
+## Typecheck vs Lint
+
+These are separate concerns:
+- **Lint** = style, patterns, common mistakes → `make lint`
+- **Typecheck** = type correctness → `make typecheck`
+
+Both run under `make check` (before tests). Keep them separate so agents know which kind of error they're dealing with.
+
+## Pinning Versions
+
+Lock linter versions in your package manager (package.json, pyproject.toml, go.mod). Never rely on globally installed linters in CI — the version will drift. Agent harness runs must be reproducible.
diff --git a/.github/skills/agent-native/references/wizard-cli.md b/.github/skills/agent-native/references/wizard-cli.md
new file mode 100644
index 0000000..2d508b1
--- /dev/null
+++ b/.github/skills/agent-native/references/wizard-cli.md
@@ -0,0 +1,67 @@
+# Wizard CLI
+
+This skill ships a Typer-based CLI wizard:
+
+- `scripts/harness_wizard.py`
+
+Use it as the primary interface for bootstrapping and evolving repositories.
+
+## Quick Start
+
+```bash
+python3 scripts/harness_wizard.py init <repo-path> --profile control
+python3 scripts/harness_wizard.py status <repo-path>
+python3 scripts/harness_wizard.py audit <repo-path>
+```
+
+## Commands
+
+### `init`
+
+Initialize a repository with harness and control-system structures.
+
+```bash
+python3 scripts/harness_wizard.py init <repo-path> --profile baseline
+python3 scripts/harness_wizard.py init <repo-path> --profile control
+python3 scripts/harness_wizard.py init <repo-path> --profile full
+python3 scripts/harness_wizard.py init <repo-path> --profile full --force
+```
+
+Profiles:
+
+- `baseline`: AGENTS/PLANS/docs/Makefile/scripts/CI core harness.
+- `control`: baseline + control primitives (`docs/control/*`, metrics yaml).
+- `full`: control + entropy controls (`entropy_check.sh`, nightly workflow).
+
+### `audit`
+
+Run baseline harness audit wrapper:
+
+```bash
+python3 scripts/harness_wizard.py audit <repo-path>
+```
+
+### `status`
+
+Show baseline and primitive coverage:
+
+```bash
+python3 scripts/harness_wizard.py status <repo-path>
+```
+
+### `primitive list`
+
+List all available control primitives and associated files:
+
+```bash
+python3 scripts/harness_wizard.py primitive list
+```
+
+### `primitive add`
+
+Add selected primitives to an existing repo incrementally:
+
+```bash
+python3 scripts/harness_wizard.py primitive add setpoint sensors --repo <repo-path>
+python3 scripts/harness_wizard.py primitive add entropy --repo <repo-path>
+```
diff --git a/.github/skills/agent-native/scripts/audit_harness.sh b/.github/skills/agent-native/scripts/audit_harness.sh
new file mode 100755
index 0000000..737dca5
--- /dev/null
+++ b/.github/skills/agent-native/scripts/audit_harness.sh
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage: audit_harness.sh [repo_path]
+
+Audit a repository for baseline harness engineering artifacts.
+EOF
+}
+
+target_path="${1:-.}"
+if [ "$target_path" = "-h" ] || [ "$target_path" = "--help" ]; then
+  usage
+  exit 0
+fi
+
+if [ ! -d "$target_path" ]; then
+  echo "error: target path does not exist: $target_path" >&2
+  exit 1
+fi
+
+target_path=$(cd "$target_path" && pwd)
+failures=0
+
+ok() {
+  echo "[ok]      $1"
+}
+
+fail() {
+  echo "[missing] $1"
+  failures=$((failures + 1))
+}
+
+check_file() {
+  local relative="$1"
+  if [ -f "$target_path/$relative" ]; then
+    ok "$relative"
+  else
+    fail "$relative"
+  fi
+}
+
+check_contains() {
+  local relative="$1"
+  local pattern="$2"
+  local label="$3"
+  local full="$target_path/$relative"
+
+  if [ ! -f "$full" ]; then
+    fail "$label (file missing: $relative)"
+    return
+  fi
+
+  if grep -Eq "$pattern" "$full"; then
+    ok "$label"
+  else
+    fail "$label"
+  fi
+}
+
+echo "Auditing harness artifacts in: $target_path"
+echo
+
+check_file "AGENTS.md"
+check_file "PLANS.md"
+check_file "docs/ARCHITECTURE.md"
+check_file "docs/OBSERVABILITY.md"
+check_file "Makefile.harness"
+check_file "scripts/audit_harness.sh"
+check_file "scripts/harness/smoke.sh"
+check_file "scripts/harness/test.sh"
+check_file "scripts/harness/lint.sh"
+check_file "scripts/harness/typecheck.sh"
+check_file ".github/workflows/harness.yml"
+
+echo
+check_contains "AGENTS.md" "Harness Commands" "AGENTS.md: Harness Commands section"
+check_contains "AGENTS.md" "Execution Plans" "AGENTS.md: Execution Plans section"
+check_contains "docs/ARCHITECTURE.md" "Boundaries" "ARCHITECTURE.md: boundary guidance"
+check_contains "docs/OBSERVABILITY.md" "Required Event Fields" "OBSERVABILITY.md: required fields"
+check_contains "Makefile.harness" "^smoke:" "Makefile.harness: smoke target"
+check_contains "Makefile.harness" "^test:" "Makefile.harness: test target"
+check_contains "Makefile.harness" "^lint:" "Makefile.harness: lint target"
+check_contains "Makefile.harness" "^typecheck:" "Makefile.harness: typecheck target"
+check_contains "Makefile.harness" "^ci:" "Makefile.harness: ci target"
+check_contains ".github/workflows/harness.yml" "make ci" "CI workflow executes make ci"
+
+echo
+if [ "$failures" -gt 0 ]; then
+  echo "Harness audit failed: $failures issue(s) detected."
+  exit 1
+fi
+
+echo "Harness audit passed."
diff --git a/.github/skills/agent-native/scripts/bootstrap_harness.sh b/.github/skills/agent-native/scripts/bootstrap_harness.sh
new file mode 100755
index 0000000..79573a4
--- /dev/null
+++ b/.github/skills/agent-native/scripts/bootstrap_harness.sh
@@ -0,0 +1,182 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage: bootstrap_harness.sh [repo_path] [--force]
+
+Install harness templates into a target repository.
+
+Arguments:
+  repo_path   Target repository path (default: current directory)
+  --force     Overwrite existing template-managed files
+EOF
+}
+
+target_path="."
+force=0
+
+while [ $# -gt 0 ]; do
+  case "$1" in
+    --force)
+      force=1
+      ;;
+    -h|--help)
+      usage
+      exit 0
+      ;;
+    *)
+      if [ "$target_path" != "." ]; then
+        echo "error: multiple repo paths provided" >&2
+        usage
+        exit 1
+      fi
+      target_path="$1"
+      ;;
+  esac
+  shift
+done
+
+if [ ! -d "$target_path" ]; then
+  echo "error: target path does not exist: $target_path" >&2
+  exit 1
+fi
+
+target_path=$(cd "$target_path" && pwd)
+script_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+skill_dir=$(cd "$script_dir/.." && pwd)
+template_dir="$skill_dir/assets/templates"
+
+if [ ! -d "$template_dir" ]; then
+  echo "error: template directory missing: $template_dir" >&2
+  exit 1
+fi
+
+copy_template() {
+  local relative="$1"
+  local source="$template_dir/$relative"
+  local destination="$target_path/$relative"
+
+  if [ ! -f "$source" ]; then
+    echo "[error] missing template: $relative" >&2
+    exit 1
+  fi
+
+  mkdir -p "$(dirname "$destination")"
+
+  if [ -f "$destination" ] && [ "$force" -ne 1 ]; then
+    echo "[skip]  $relative (exists)"
+    return 0
+  fi
+
+  cp "$source" "$destination"
+  echo "[write] $relative"
+}
+
+templates=(
+  "AGENTS.md"
+  "PLANS.md"
+  "docs/ARCHITECTURE.md"
+  "docs/OBSERVABILITY.md"
+  "Makefile.harness"
+  "scripts/audit_harness.sh"
+  "scripts/verify_customized.sh"
+  "scripts/harness/setup.sh"
+  "scripts/harness/format.sh"
+  "scripts/harness/smoke.sh"
+  "scripts/harness/test.sh"
+  "scripts/harness/lint.sh"
+  "scripts/harness/typecheck.sh"
+  ".github/workflows/harness.yml"
+)
+
+for relative in "${templates[@]}"; do
+  copy_template "$relative"
+done
+
+makefile="$target_path/Makefile"
+if [ ! -f "$makefile" ]; then
+  cat > "$makefile" <<'EOF'
+-include Makefile.harness
+EOF
+  echo "[write] Makefile"
+elif ! grep -Eq '(^|[[:space:]])-?include[[:space:]]+Makefile\.harness([[:space:]]|$)' "$makefile"; then
+  cat >> "$makefile" <<'EOF'
+
+# Harness engineering targets
+-include Makefile.harness
+EOF
+  echo "[update] Makefile (+ include Makefile.harness)"
+else
+  echo "[skip]  Makefile already includes Makefile.harness"
+fi
+
+chmod +x \
+  "$target_path/scripts/audit_harness.sh" \
+  "$target_path/scripts/verify_customized.sh" \
+  "$target_path/scripts/harness/setup.sh" \
+  "$target_path/scripts/harness/format.sh" \
+  "$target_path/scripts/harness/smoke.sh" \
+  "$target_path/scripts/harness/test.sh" \
+  "$target_path/scripts/harness/lint.sh" \
+  "$target_path/scripts/harness/typecheck.sh"
+
+# Create .harness/ observability directory and jq query library
+mkdir -p "$target_path/.harness/queries"
+
+# Seed empty JSONL files
+for f in logs.jsonl traces.jsonl metrics.jsonl; do
+  touch "$target_path/.harness/$f"
+done
+
+# Write jq query library
+cat > "$target_path/.harness/queries/errors.jq" <<'JQ'
+# Recent errors
+select(.level == "ERROR")
+JQ
+
+cat > "$target_path/.harness/queries/problems.jq" <<'JQ'
+# All problems (4xx warnings + 5xx errors)
+select(.level == "ERROR" or .level == "WARN")
+JQ
+
+cat > "$target_path/.harness/queries/slow.jq" <<'JQ'
+# Slow requests — usage: jq --argjson threshold 500 -f slow.jq traces.jsonl
+select(.duration_ms > $threshold)
+JQ
+
+cat > "$target_path/.harness/queries/trace.jq" <<'JQ'
+# All events for a trace — usage: jq --arg tid <id> -f trace.jq logs.jsonl traces.jsonl
+select(.trace_id == $tid)
+JQ
+
+cat > "$target_path/.harness/queries/summary.jq" <<'JQ'
+# Error summary by service
+[.] | group_by(.service) | map({service: .[0].service, count: length})
+JQ
+
+echo "[write] .harness/ (observability: JSONL files + jq query library)"
+
+# Add .harness/ to .gitignore if not already there
+gitignore="$target_path/.gitignore"
+if [ -f "$gitignore" ]; then
+  if ! grep -qF '.harness/' "$gitignore"; then
+    echo '.harness/' >> "$gitignore"
+    echo "[update] .gitignore (+ .harness/)"
+  fi
+else
+  echo '.harness/' > "$gitignore"
+  echo "[write] .gitignore"
+fi
+
+echo
+echo "Bootstrap complete."
+echo "Next:"
+echo "  1) Read AGENTS.md — it tells you what to customize and what conventions to follow"
+echo "  2) Read docs/OBSERVABILITY.md — follow the Level Policy (2xx=INFO, 4xx=WARN, 5xx=ERROR)"
+echo "  3) Customize docs/ARCHITECTURE.md — replace ALL placeholders with real project info"
+echo "  4) Fill in scripts/harness/*.sh with real project commands"
+echo "  5) Add hlog() calls to your app (see language examples in docs/OBSERVABILITY.md)"
+echo "  6) Verify: make -f Makefile.harness ci"
+echo "  7) Verify customization: scripts/verify_customized.sh ."
+echo "  8) Audit: scripts/audit_harness.sh ."
diff --git a/.github/skills/agent-native/scripts/harness_wizard.py b/.github/skills/agent-native/scripts/harness_wizard.py
new file mode 100755
index 0000000..a0056ad
--- /dev/null
+++ b/.github/skills/agent-native/scripts/harness_wizard.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+"""Harness Engineering wizard CLI.
+
+This command-line tool bootstraps and upgrades repositories so they are
+harness/control-system ready for autonomous agent workflows.
+"""
+
+from __future__ import annotations
+
+import subprocess
+from enum import Enum
+from pathlib import Path
+from typing import Dict, Iterable, List, Tuple
+
+import typer
+
+app = typer.Typer(help="Harness engineering wizard for repository setup and control primitives.")
+primitive_app = typer.Typer(help="Add or inspect control-system primitives.")
+app.add_typer(primitive_app, name="primitive")
+
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+SKILL_DIR = SCRIPT_DIR.parent
+TEMPLATE_DIR = SKILL_DIR / "assets" / "templates"
+BOOTSTRAP_SCRIPT = SCRIPT_DIR / "bootstrap_harness.sh"
+AUDIT_SCRIPT = SCRIPT_DIR / "audit_harness.sh"
+
+BASELINE_FILES: Tuple[str, ...] = (
+    "AGENTS.md",
+    "PLANS.md",
+    "docs/ARCHITECTURE.md",
+    "docs/OBSERVABILITY.md",
+    "Makefile.harness",
+    "scripts/audit_harness.sh",
+    "scripts/harness/smoke.sh",
+    "scripts/harness/test.sh",
+    "scripts/harness/lint.sh",
+    "scripts/harness/typecheck.sh",
+    ".github/workflows/harness.yml",
+)
+
+
+class Primitive(str, Enum):
+    loop = "loop"
+    setpoint = "setpoint"
+    sensors = "sensors"
+    controller = "controller"
+    actuators = "actuators"
+    feedback = "feedback"
+    stability = "stability"
+    entropy = "entropy"
+
+
+class Profile(str, Enum):
+    baseline = "baseline"
+    control = "control"
+    full = "full"
+
+
+PRIMITIVE_FILES: Dict[Primitive, Tuple[str, ...]] = {
+    Primitive.loop: ("docs/control/CONTROL_SYSTEM.md",),
+    Primitive.setpoint: (
+        "docs/control/SETPOINTS.md",
+        "evals/control-loop-metrics.yaml",
+    ),
+    Primitive.sensors: ("docs/control/SENSORS.md",),
+    Primitive.controller: ("docs/control/CONTROLLER.md",),
+    Primitive.actuators: ("docs/control/ACTUATORS.md",),
+    Primitive.feedback: ("docs/control/FEEDBACK_LOOP.md",),
+    Primitive.stability: ("docs/control/STABILITY.md",),
+    Primitive.entropy: (
+        "docs/control/ENTROPY.md",
+        "scripts/harness/entropy_check.sh",
+        ".github/workflows/nightly-harness-audit.yml",
+    ),
+}
+
+CONTROL_PROFILE: Tuple[Primitive, ...] = (
+    Primitive.loop,
+    Primitive.setpoint,
+    Primitive.sensors,
+    Primitive.controller,
+    Primitive.actuators,
+    Primitive.feedback,
+    Primitive.stability,
+)
+
+FULL_PROFILE: Tuple[Primitive, ...] = CONTROL_PROFILE + (Primitive.entropy,)
+
+
+def _resolve_repo(repo_path: Path) -> Path:
+    repo = repo_path.expanduser().resolve()
+    if not repo.exists() or not repo.is_dir():
+        typer.secho(f"error: repo path does not exist: {repo}", fg=typer.colors.RED, err=True)
+        raise typer.Exit(code=2)
+    return repo
+
+
+def _run(script: Path, args: List[str]) -> None:
+    if not script.exists():
+        typer.secho(f"error: script not found: {script}", fg=typer.colors.RED, err=True)
+        raise typer.Exit(code=2)
+    command = [str(script), *args]
+    result = subprocess.run(command, check=False)
+    if result.returncode != 0:
+        raise typer.Exit(code=result.returncode)
+
+
+def _copy_template(relative_path: str, repo: Path, force: bool) -> str:
+    source = TEMPLATE_DIR / relative_path
+    target = repo / relative_path
+    if not source.exists():
+        typer.secho(f"error: missing template file: {source}", fg=typer.colors.RED, err=True)
+        raise typer.Exit(code=2)
+
+    target.parent.mkdir(parents=True, exist_ok=True)
+    if target.exists() and not force:
+        return "skip"
+
+    target.write_bytes(source.read_bytes())
+    if target.suffix == ".sh":
+        target.chmod(0o755)
+    return "write"
+
+
+def _add_primitives(repo: Path, primitives: Iterable[Primitive], force: bool) -> None:
+    for primitive in primitives:
+        typer.secho(f"\n[{primitive.value}]")
+        for relative_path in PRIMITIVE_FILES[primitive]:
+            state = _copy_template(relative_path, repo, force)
+            verb = "write" if state == "write" else "skip "
+            typer.echo(f"  [{verb}] {relative_path}")
+
+
+def _check_exists(repo: Path, relative_path: str) -> bool:
+    return (repo / relative_path).exists()
+
+
+def _primitive_status(repo: Path, primitive: Primitive) -> Tuple[int, int]:
+    files = PRIMITIVE_FILES[primitive]
+    present = sum(1 for rel in files if _check_exists(repo, rel))
+    return present, len(files)
+
+
+@app.command()
+def init(
+    repo_path: Path = typer.Argument(Path("."), help="Target repository path."),
+    profile: Profile = typer.Option(
+        Profile.control,
+        "--profile",
+        "-p",
+        help="Setup profile: baseline, control, or full.",
+    ),
+    force: bool = typer.Option(False, "--force", help="Overwrite existing files."),
+) -> None:
+    """Initialize harness artifacts and optionally apply control primitives."""
+    repo = _resolve_repo(repo_path)
+    typer.secho(f"Initializing harness in {repo}", fg=typer.colors.CYAN)
+
+    args = [str(repo)]
+    if force:
+        args.append("--force")
+    _run(BOOTSTRAP_SCRIPT, args)
+
+    if profile == Profile.baseline:
+        return
+
+    primitives = CONTROL_PROFILE if profile == Profile.control else FULL_PROFILE
+    _add_primitives(repo, primitives, force=force)
+    typer.secho("\nInitialization complete.", fg=typer.colors.GREEN)
+
+
+@app.command()
+def audit(
+    repo_path: Path = typer.Argument(Path("."), help="Target repository path."),
+) -> None:
+    """Run baseline harness audit."""
+    repo = _resolve_repo(repo_path)
+    _run(AUDIT_SCRIPT, [str(repo)])
+
+
+@app.command()
+def status(
+    repo_path: Path = typer.Argument(Path("."), help="Target repository path."),
+) -> None:
+    """Show harness and primitive coverage status."""
+    repo = _resolve_repo(repo_path)
+    typer.secho(f"Harness status for {repo}", fg=typer.colors.CYAN)
+    typer.echo()
+
+    baseline_present = sum(1 for rel in BASELINE_FILES if _check_exists(repo, rel))
+    baseline_total = len(BASELINE_FILES)
+    typer.echo(f"baseline: {baseline_present}/{baseline_total}")
+    for rel in BASELINE_FILES:
+        mark = "OK " if _check_exists(repo, rel) else "MISS"
+        typer.echo(f"  [{mark}] {rel}")
+
+    typer.echo()
+    typer.echo("control primitives:")
+    for primitive in Primitive:
+        present, total = _primitive_status(repo, primitive)
+        mark = "OK " if present == total else "PARTIAL" if present > 0 else "MISS"
+        typer.echo(f"  [{mark}] {primitive.value}: {present}/{total}")
+
+
+@primitive_app.command("list")
+def primitive_list() -> None:
+    """List available control primitives and their files."""
+    for primitive in Primitive:
+        typer.echo(f"{primitive.value}")
+        for rel in PRIMITIVE_FILES[primitive]:
+            typer.echo(f"  - {rel}")
+
+
+@primitive_app.command("add")
+def primitive_add(
+    primitives: List[Primitive] = typer.Argument(..., help="Primitive names to add."),
+    repo_path: Path = typer.Option(Path("."), "--repo", "-r", help="Target repository path."),
+    force: bool = typer.Option(False, "--force", help="Overwrite existing primitive files."),
+) -> None:
+    """Add specific control primitives to a repository."""
+    repo = _resolve_repo(repo_path)
+    _add_primitives(repo, primitives, force=force)
+    typer.secho("\nPrimitive update complete.", fg=typer.colors.GREEN)
+
+
+if __name__ == "__main__":
+    app()
diff --git a/.github/skills/agent-native/scripts/verify_customized.sh b/.github/skills/agent-native/scripts/verify_customized.sh
new file mode 100755
index 0000000..988fa21
--- /dev/null
+++ b/.github/skills/agent-native/scripts/verify_customized.sh
@@ -0,0 +1,172 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# VERIFY CUSTOMIZED — Checks that template boilerplate has been replaced.
+#
+# Run after setting up harness engineering to catch leftover placeholders.
+# Exit code 0 = all customized, 1 = boilerplate detected.
+
+target_path="${1:-.}"
+target_path=$(cd "$target_path" && pwd)
+
+failures=0
+total=0
+
+pass() {
+  total=$((total + 1))
+  echo "  ✅ $1"
+}
+
+fail() {
+  total=$((total + 1))
+  failures=$((failures + 1))
+  echo "  ❌ $1"
+}
+
+check_no_placeholders() {
+  local file="$1"
+  local label="$2"
+  local full="$target_path/$file"
+
+  if [ ! -f "$full" ]; then
+    fail "$label — file missing"
+    return
+  fi
+
+  # Check for common template placeholder patterns
+  if grep -qE '<project-name>|<runtime>|<entrypoints>|_Replace:|_Replace_' "$full"; then
+    fail "$label — still contains template placeholders"
+    echo "       $(grep -nE '<project-name>|<runtime>|<entrypoints>|_Replace:|_Replace_' "$full" | head -3)"
+    return
+  fi
+
+  pass "$label"
+}
+
+check_not_empty_template() {
+  local file="$1"
+  local label="$2"
+  local min_lines="${3:-10}"
+  local full="$target_path/$file"
+
+  if [ ! -f "$full" ]; then
+    fail "$label — file missing"
+    return
+  fi
+
+  local lines
+  lines=$(wc -l < "$full" | tr -d ' ')
+  if [ "$lines" -lt "$min_lines" ]; then
+    fail "$label — only $lines lines (expected ≥$min_lines, likely not customized)"
+    return
+  fi
+
+  pass "$label ($lines lines)"
+}
+
+check_plans_customized() {
+  local full="$target_path/PLANS.md"
+
+  if [ ! -f "$full" ]; then
+    fail "PLANS.md — file missing"
+    return
+  fi
+
+  # Check if it's still the raw template (contains multiple unchecked placeholders)
+  local placeholder_count
+  placeholder_count=$(grep -cE '^\s*-\s*\[ \]|<describe|<list|_TBD_|_TODO_' "$full" 2>/dev/null || true)
+  placeholder_count="${placeholder_count:-0}"
+  placeholder_count=$(echo "$placeholder_count" | tr -d '[:space:]')
+  local total_lines
+  total_lines=$(wc -l < "$full" | tr -d ' ')
+
+  if [ "$placeholder_count" -gt 3 ] && [ "$total_lines" -lt 30 ]; then
+    fail "PLANS.md — appears to be raw template ($placeholder_count placeholders in $total_lines lines)"
+    return
+  fi
+
+  pass "PLANS.md customized"
+}
+
+check_smoke_is_real() {
+  local full="$target_path/scripts/harness/smoke.sh"
+
+  if [ ! -f "$full" ]; then
+    fail "smoke.sh — file missing"
+    return
+  fi
+
+  # A real smoke script should have curl or wget or a health check
+  if grep -qE 'curl|wget|health|localhost|127\.0\.0\.1|START_CMD' "$full"; then
+    pass "smoke.sh — contains server lifecycle logic"
+  else
+    fail "smoke.sh — no server start/health check detected (might be a build-only stub)"
+  fi
+}
+
+check_tests_exist() {
+  local test_count=0
+
+  # Python
+  test_count=$((test_count + $(find "$target_path" -name "test_*.py" -o -name "*_test.py" 2>/dev/null | wc -l | tr -d ' ')))
+  # TypeScript/JS
+  test_count=$((test_count + $(find "$target_path" -name "*.test.ts" -o -name "*.spec.ts" -o -name "*.test.js" -o -name "*.spec.js" 2>/dev/null | wc -l | tr -d ' ')))
+  # Go
+  test_count=$((test_count + $(find "$target_path" -name "*_test.go" 2>/dev/null | wc -l | tr -d ' ')))
+  # C#
+  test_count=$((test_count + $(find "$target_path" -name "*Test*.cs" -o -name "*Tests*.cs" 2>/dev/null | wc -l | tr -d ' ')))
+
+  if [ "$test_count" -gt 0 ]; then
+    pass "Test files found ($test_count files)"
+  else
+    fail "No test files found"
+  fi
+}
+
+check_observability_wired() {
+  # Check if hlog/htrace/HLog/HTrace/Harness.Log/harness_log appears in source
+  local hits
+  hits=$(grep -rlE 'hlog|htrace|HLog|HTrace|Harness\w*\.(Log|Trace)|harness_log|harness_trace|HarnessTelemetry|HarnessLogger|\.harness/logs' "$target_path" \
+    --include="*.py" --include="*.ts" --include="*.js" --include="*.go" --include="*.cs" 2>/dev/null | \
+    grep -v node_modules | grep -v ".harness/" | wc -l | tr -d ' ')
+
+  if [ "$hits" -gt 0 ]; then
+    pass "Observability wired in source ($hits files)"
+  else
+    fail "No hlog()/htrace() calls found in source code"
+  fi
+}
+
+echo "🔍 Verifying harness customization in: $target_path"
+echo ""
+
+echo "📄 Template Placeholders:"
+check_no_placeholders "AGENTS.md" "AGENTS.md"
+check_no_placeholders "docs/ARCHITECTURE.md" "docs/ARCHITECTURE.md"
+check_no_placeholders "docs/OBSERVABILITY.md" "docs/OBSERVABILITY.md"
+echo ""
+
+echo "📋 Content Depth:"
+check_not_empty_template "AGENTS.md" "AGENTS.md" 15
+check_not_empty_template "docs/ARCHITECTURE.md" "docs/ARCHITECTURE.md" 20
+check_not_empty_template "docs/OBSERVABILITY.md" "docs/OBSERVABILITY.md" 15
+check_plans_customized
+echo ""
+
+echo "🔧 Harness Scripts:"
+check_smoke_is_real
+check_tests_exist
+echo ""
+
+echo "📊 Observability:"
+check_observability_wired
+echo ""
+
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+if [ "$failures" -gt 0 ]; then
+  echo "❌ Verification failed: $failures/$total checks failed."
+  echo "   Fix the issues above before considering the harness complete."
+  exit 1
+fi
+
+echo "✅ All $total checks passed — harness is customized and wired."
diff --git a/.github/skills/react-best-practices/AGENTS.md b/.github/skills/react-best-practices/AGENTS.md
new file mode 100644
index 0000000..3bdafa1
--- /dev/null
+++ b/.github/skills/react-best-practices/AGENTS.md
@@ -0,0 +1,3254 @@
+# React Best Practices
+
+**Version 1.0.0**  
+Vercel Engineering  
+January 2026
+
+> **Note:**  
+> This document is mainly for agents and LLMs to follow when maintaining,  
+> generating, or refactoring React and Next.js codebases. Humans  
+> may also find it useful, but guidance here is optimized for automation  
+> and consistency by AI-assisted workflows.
+
+---
+
+## Abstract
+
+Comprehensive performance optimization guide for React and Next.js applications, designed for AI agents and LLMs. Contains 40+ rules across 8 categories, prioritized by impact from critical (eliminating waterfalls, reducing bundle size) to incremental (advanced patterns). Each rule includes detailed explanations, real-world examples comparing incorrect vs. correct implementations, and specific impact metrics to guide automated refactoring and code generation.
+
+---
+
+## Table of Contents
+
+1. [Eliminating Waterfalls](#1-eliminating-waterfalls) — **CRITICAL**
+   - 1.1 [Defer Await Until Needed](#11-defer-await-until-needed)
+   - 1.2 [Dependency-Based Parallelization](#12-dependency-based-parallelization)
+   - 1.3 [Prevent Waterfall Chains in API Routes](#13-prevent-waterfall-chains-in-api-routes)
+   - 1.4 [Promise.all() for Independent Operations](#14-promiseall-for-independent-operations)
+   - 1.5 [Strategic Suspense Boundaries](#15-strategic-suspense-boundaries)
+2. [Bundle Size Optimization](#2-bundle-size-optimization) — **CRITICAL**
+   - 2.1 [Avoid Barrel File Imports](#21-avoid-barrel-file-imports)
+   - 2.2 [Conditional Module Loading](#22-conditional-module-loading)
+   - 2.3 [Defer Non-Critical Third-Party Libraries](#23-defer-non-critical-third-party-libraries)
+   - 2.4 [Dynamic Imports for Heavy Components](#24-dynamic-imports-for-heavy-components)
+   - 2.5 [Preload Based on User Intent](#25-preload-based-on-user-intent)
+3. [Server-Side Performance](#3-server-side-performance) — **HIGH**
+   - 3.1 [Authenticate Server Actions Like API Routes](#31-authenticate-server-actions-like-api-routes)
+   - 3.2 [Avoid Duplicate Serialization in RSC Props](#32-avoid-duplicate-serialization-in-rsc-props)
+   - 3.3 [Cross-Request LRU Caching](#33-cross-request-lru-caching)
+   - 3.4 [Hoist Static I/O to Module Level](#34-hoist-static-io-to-module-level)
+   - 3.5 [Minimize Serialization at RSC Boundaries](#35-minimize-serialization-at-rsc-boundaries)
+   - 3.6 [Parallel Data Fetching with Component Composition](#36-parallel-data-fetching-with-component-composition)
+   - 3.7 [Per-Request Deduplication with React.cache()](#37-per-request-deduplication-with-reactcache)
+   - 3.8 [Use after() for Non-Blocking Operations](#38-use-after-for-non-blocking-operations)
+4. [Client-Side Data Fetching](#4-client-side-data-fetching) — **MEDIUM-HIGH**
+   - 4.1 [Deduplicate Global Event Listeners](#41-deduplicate-global-event-listeners)
+   - 4.2 [Use Passive Event Listeners for Scrolling Performance](#42-use-passive-event-listeners-for-scrolling-performance)
+   - 4.3 [Use SWR for Automatic Deduplication](#43-use-swr-for-automatic-deduplication)
+   - 4.4 [Version and Minimize localStorage Data](#44-version-and-minimize-localstorage-data)
+5. [Re-render Optimization](#5-re-render-optimization) — **MEDIUM**
+   - 5.1 [Calculate Derived State During Rendering](#51-calculate-derived-state-during-rendering)
+   - 5.2 [Defer State Reads to Usage Point](#52-defer-state-reads-to-usage-point)
+   - 5.3 [Do not wrap a simple expression with a primitive result type in useMemo](#53-do-not-wrap-a-simple-expression-with-a-primitive-result-type-in-usememo)
+   - 5.4 [Don't Define Components Inside Components](#54-dont-define-components-inside-components)
+   - 5.5 [Extract Default Non-primitive Parameter Value from Memoized Component to Constant](#55-extract-default-non-primitive-parameter-value-from-memoized-component-to-constant)
+   - 5.6 [Extract to Memoized Components](#56-extract-to-memoized-components)
+   - 5.7 [Narrow Effect Dependencies](#57-narrow-effect-dependencies)
+   - 5.8 [Put Interaction Logic in Event Handlers](#58-put-interaction-logic-in-event-handlers)
+   - 5.9 [Subscribe to Derived State](#59-subscribe-to-derived-state)
+   - 5.10 [Use Functional setState Updates](#510-use-functional-setstate-updates)
+   - 5.11 [Use Lazy State Initialization](#511-use-lazy-state-initialization)
+   - 5.12 [Use Transitions for Non-Urgent Updates](#512-use-transitions-for-non-urgent-updates)
+   - 5.13 [Use useRef for Transient Values](#513-use-useref-for-transient-values)
+6. [Rendering Performance](#6-rendering-performance) — **MEDIUM**
+   - 6.1 [Animate SVG Wrapper Instead of SVG Element](#61-animate-svg-wrapper-instead-of-svg-element)
+   - 6.2 [CSS content-visibility for Long Lists](#62-css-content-visibility-for-long-lists)
+   - 6.3 [Hoist Static JSX Elements](#63-hoist-static-jsx-elements)
+   - 6.4 [Optimize SVG Precision](#64-optimize-svg-precision)
+   - 6.5 [Prevent Hydration Mismatch Without Flickering](#65-prevent-hydration-mismatch-without-flickering)
+   - 6.6 [Suppress Expected Hydration Mismatches](#66-suppress-expected-hydration-mismatches)
+   - 6.7 [Use Activity Component for Show/Hide](#67-use-activity-component-for-showhide)
+   - 6.8 [Use defer or async on Script Tags](#68-use-defer-or-async-on-script-tags)
+   - 6.9 [Use Explicit Conditional Rendering](#69-use-explicit-conditional-rendering)
+   - 6.10 [Use React DOM Resource Hints](#610-use-react-dom-resource-hints)
+   - 6.11 [Use useTransition Over Manual Loading States](#611-use-usetransition-over-manual-loading-states)
+7. [JavaScript Performance](#7-javascript-performance) — **LOW-MEDIUM**
+   - 7.1 [Avoid Layout Thrashing](#71-avoid-layout-thrashing)
+   - 7.2 [Build Index Maps for Repeated Lookups](#72-build-index-maps-for-repeated-lookups)
+   - 7.3 [Cache Property Access in Loops](#73-cache-property-access-in-loops)
+   - 7.4 [Cache Repeated Function Calls](#74-cache-repeated-function-calls)
+   - 7.5 [Cache Storage API Calls](#75-cache-storage-api-calls)
+   - 7.6 [Combine Multiple Array Iterations](#76-combine-multiple-array-iterations)
+   - 7.7 [Early Length Check for Array Comparisons](#77-early-length-check-for-array-comparisons)
+   - 7.8 [Early Return from Functions](#78-early-return-from-functions)
+   - 7.9 [Hoist RegExp Creation](#79-hoist-regexp-creation)
+   - 7.10 [Use flatMap to Map and Filter in One Pass](#710-use-flatmap-to-map-and-filter-in-one-pass)
+   - 7.11 [Use Loop for Min/Max Instead of Sort](#711-use-loop-for-minmax-instead-of-sort)
+   - 7.12 [Use Set/Map for O(1) Lookups](#712-use-setmap-for-o1-lookups)
+   - 7.13 [Use toSorted() Instead of sort() for Immutability](#713-use-tosorted-instead-of-sort-for-immutability)
+8. [Advanced Patterns](#8-advanced-patterns) — **LOW**
+   - 8.1 [Initialize App Once, Not Per Mount](#81-initialize-app-once-not-per-mount)
+   - 8.2 [Store Event Handlers in Refs](#82-store-event-handlers-in-refs)
+   - 8.3 [useEffectEvent for Stable Callback Refs](#83-useeffectevent-for-stable-callback-refs)
+
+---
+
+## 1. Eliminating Waterfalls
+
+**Impact: CRITICAL**
+
+Waterfalls are the #1 performance killer. Each sequential await adds full network latency. Eliminating them yields the largest gains.
+
+### 1.1 Defer Await Until Needed
+
+**Impact: HIGH (avoids blocking unused code paths)**
+
+Move `await` operations into the branches where they're actually used to avoid blocking code paths that don't need them.
+
+**Incorrect: blocks both branches**
+
+```typescript
+async function handleRequest(userId: string, skipProcessing: boolean) {
+  const userData = await fetchUserData(userId)
+  
+  if (skipProcessing) {
+    // Returns immediately but still waited for userData
+    return { skipped: true }
+  }
+  
+  // Only this branch uses userData
+  return processUserData(userData)
+}
+```
+
+**Correct: only blocks when needed**
+
+```typescript
+async function handleRequest(userId: string, skipProcessing: boolean) {
+  if (skipProcessing) {
+    // Returns immediately without waiting
+    return { skipped: true }
+  }
+  
+  // Fetch only when needed
+  const userData = await fetchUserData(userId)
+  return processUserData(userData)
+}
+```
+
+**Another example: early return optimization**
+
+```typescript
+// Incorrect: always fetches permissions
+async function updateResource(resourceId: string, userId: string) {
+  const permissions = await fetchPermissions(userId)
+  const resource = await getResource(resourceId)
+  
+  if (!resource) {
+    return { error: 'Not found' }
+  }
+  
+  if (!permissions.canEdit) {
+    return { error: 'Forbidden' }
+  }
+  
+  return await updateResourceData(resource, permissions)
+}
+
+// Correct: fetches only when needed
+async function updateResource(resourceId: string, userId: string) {
+  const resource = await getResource(resourceId)
+  
+  if (!resource) {
+    return { error: 'Not found' }
+  }
+  
+  const permissions = await fetchPermissions(userId)
+  
+  if (!permissions.canEdit) {
+    return { error: 'Forbidden' }
+  }
+  
+  return await updateResourceData(resource, permissions)
+}
+```
+
+This optimization is especially valuable when the skipped branch is frequently taken, or when the deferred operation is expensive.
+
+### 1.2 Dependency-Based Parallelization
+
+**Impact: CRITICAL (2-10× improvement)**
+
+For operations with partial dependencies, use `better-all` to maximize parallelism. It automatically starts each task at the earliest possible moment.
+
+**Incorrect: profile waits for config unnecessarily**
+
+```typescript
+const [user, config] = await Promise.all([
+  fetchUser(),
+  fetchConfig()
+])
+const profile = await fetchProfile(user.id)
+```
+
+**Correct: config and profile run in parallel**
+
+```typescript
+import { all } from 'better-all'
+
+const { user, config, profile } = await all({
+  async user() { return fetchUser() },
+  async config() { return fetchConfig() },
+  async profile() {
+    return fetchProfile((await this.$.user).id)
+  }
+})
+```
+
+**Alternative without extra dependencies:**
+
+```typescript
+const userPromise = fetchUser()
+const profilePromise = userPromise.then(user => fetchProfile(user.id))
+
+const [user, config, profile] = await Promise.all([
+  userPromise,
+  fetchConfig(),
+  profilePromise
+])
+```
+
+We can also create all the promises first, and do `Promise.all()` at the end.
+
+Reference: [https://github.com/shuding/better-all](https://github.com/shuding/better-all)
+
+### 1.3 Prevent Waterfall Chains in API Routes
+
+**Impact: CRITICAL (2-10× improvement)**
+
+In API routes and Server Actions, start independent operations immediately, even if you don't await them yet.
+
+**Incorrect: config waits for auth, data waits for both**
+
+```typescript
+export async function GET(request: Request) {
+  const session = await auth()
+  const config = await fetchConfig()
+  const data = await fetchData(session.user.id)
+  return Response.json({ data, config })
+}
+```
+
+**Correct: auth and config start immediately**
+
+```typescript
+export async function GET(request: Request) {
+  const sessionPromise = auth()
+  const configPromise = fetchConfig()
+  const session = await sessionPromise
+  const [config, data] = await Promise.all([
+    configPromise,
+    fetchData(session.user.id)
+  ])
+  return Response.json({ data, config })
+}
+```
+
+For operations with more complex dependency chains, use `better-all` to automatically maximize parallelism (see Dependency-Based Parallelization).
+
+### 1.4 Promise.all() for Independent Operations
+
+**Impact: CRITICAL (2-10× improvement)**
+
+When async operations have no interdependencies, execute them concurrently using `Promise.all()`.
+
+**Incorrect: sequential execution, 3 round trips**
+
+```typescript
+const user = await fetchUser()
+const posts = await fetchPosts()
+const comments = await fetchComments()
+```
+
+**Correct: parallel execution, 1 round trip**
+
+```typescript
+const [user, posts, comments] = await Promise.all([
+  fetchUser(),
+  fetchPosts(),
+  fetchComments()
+])
+```
+
+### 1.5 Strategic Suspense Boundaries
+
+**Impact: HIGH (faster initial paint)**
+
+Instead of awaiting data in async components before returning JSX, use Suspense boundaries to show the wrapper UI faster while data loads.
+
+**Incorrect: wrapper blocked by data fetching**
+
+```tsx
+async function Page() {
+  const data = await fetchData() // Blocks entire page
+  
+  return (
+    <div>
+      <div>Sidebar</div>
+      <div>Header</div>
+      <div>
+        <DataDisplay data={data} />
+      </div>
+      <div>Footer</div>
+    </div>
+  )
+}
+```
+
+The entire layout waits for data even though only the middle section needs it.
+
+**Correct: wrapper shows immediately, data streams in**
+
+```tsx
+function Page() {
+  return (
+    <div>
+      <div>Sidebar</div>
+      <div>Header</div>
+      <div>
+        <Suspense fallback={<Skeleton />}>
+          <DataDisplay />
+        </Suspense>
+      </div>
+      <div>Footer</div>
+    </div>
+  )
+}
+
+async function DataDisplay() {
+  const data = await fetchData() // Only blocks this component
+  return <div>{data.content}</div>
+}
+```
+
+Sidebar, Header, and Footer render immediately. Only DataDisplay waits for data.
+
+**Alternative: share promise across components**
+
+```tsx
+function Page() {
+  // Start fetch immediately, but don't await
+  const dataPromise = fetchData()
+  
+  return (
+    <div>
+      <div>Sidebar</div>
+      <div>Header</div>
+      <Suspense fallback={<Skeleton />}>
+        <DataDisplay dataPromise={dataPromise} />
+        <DataSummary dataPromise={dataPromise} />
+      </Suspense>
+      <div>Footer</div>
+    </div>
+  )
+}
+
+function DataDisplay({ dataPromise }: { dataPromise: Promise<Data> }) {
+  const data = use(dataPromise) // Unwraps the promise
+  return <div>{data.content}</div>
+}
+
+function DataSummary({ dataPromise }: { dataPromise: Promise<Data> }) {
+  const data = use(dataPromise) // Reuses the same promise
+  return <div>{data.summary}</div>
+}
+```
+
+Both components share the same promise, so only one fetch occurs. Layout renders immediately while both components wait together.
+
+**When NOT to use this pattern:**
+
+- Critical data needed for layout decisions (affects positioning)
+
+- SEO-critical content above the fold
+
+- Small, fast queries where suspense overhead isn't worth it
+
+- When you want to avoid layout shift (loading → content jump)
+
+**Trade-off:** Faster initial paint vs potential layout shift. Choose based on your UX priorities.
+
+---
+
+## 2. Bundle Size Optimization
+
+**Impact: CRITICAL**
+
+Reducing initial bundle size improves Time to Interactive and Largest Contentful Paint.
+
+### 2.1 Avoid Barrel File Imports
+
+**Impact: CRITICAL (200-800ms import cost, slow builds)**
+
+Import directly from source files instead of barrel files to avoid loading thousands of unused modules. **Barrel files** are entry points that re-export multiple modules (e.g., `index.js` that does `export * from './module'`).
+
+Popular icon and component libraries can have **up to 10,000 re-exports** in their entry file. For many React packages, **it takes 200-800ms just to import them**, affecting both development speed and production cold starts.
+
+**Why tree-shaking doesn't help:** When a library is marked as external (not bundled), the bundler can't optimize it. If you bundle it to enable tree-shaking, builds become substantially slower analyzing the entire module graph.
+
+**Incorrect: imports entire library**
+
+```tsx
+import { Check, X, Menu } from 'lucide-react'
+// Loads 1,583 modules, takes ~2.8s extra in dev
+// Runtime cost: 200-800ms on every cold start
+
+import { Button, TextField } from '@mui/material'
+// Loads 2,225 modules, takes ~4.2s extra in dev
+```
+
+**Correct: imports only what you need**
+
+```tsx
+import Check from 'lucide-react/dist/esm/icons/check'
+import X from 'lucide-react/dist/esm/icons/x'
+import Menu from 'lucide-react/dist/esm/icons/menu'
+// Loads only 3 modules (~2KB vs ~1MB)
+
+import Button from '@mui/material/Button'
+import TextField from '@mui/material/TextField'
+// Loads only what you use
+```
+
+**Alternative: Next.js 13.5+**
+
+```js
+// next.config.js - use optimizePackageImports
+module.exports = {
+  experimental: {
+    optimizePackageImports: ['lucide-react', '@mui/material']
+  }
+}
+
+// Then you can keep the ergonomic barrel imports:
+import { Check, X, Menu } from 'lucide-react'
+// Automatically transformed to direct imports at build time
+```
+
+Direct imports provide 15-70% faster dev boot, 28% faster builds, 40% faster cold starts, and significantly faster HMR.
+
+Libraries commonly affected: `lucide-react`, `@mui/material`, `@mui/icons-material`, `@tabler/icons-react`, `react-icons`, `@headlessui/react`, `@radix-ui/react-*`, `lodash`, `ramda`, `date-fns`, `rxjs`, `react-use`.
+
+Reference: [https://vercel.com/blog/how-we-optimized-package-imports-in-next-js](https://vercel.com/blog/how-we-optimized-package-imports-in-next-js)
+
+### 2.2 Conditional Module Loading
+
+**Impact: HIGH (loads large data only when needed)**
+
+Load large data or modules only when a feature is activated.
+
+**Example: lazy-load animation frames**
+
+```tsx
+function AnimationPlayer({ enabled, setEnabled }: { enabled: boolean; setEnabled: React.Dispatch<React.SetStateAction<boolean>> }) {
+  const [frames, setFrames] = useState<Frame[] | null>(null)
+
+  useEffect(() => {
+    if (enabled && !frames && typeof window !== 'undefined') {
+      import('./animation-frames.js')
+        .then(mod => setFrames(mod.frames))
+        .catch(() => setEnabled(false))
+    }
+  }, [enabled, frames, setEnabled])
+
+  if (!frames) return <Skeleton />
+  return <Canvas frames={frames} />
+}
+```
+
+The `typeof window !== 'undefined'` check prevents bundling this module for SSR, optimizing server bundle size and build speed.
+
+### 2.3 Defer Non-Critical Third-Party Libraries
+
+**Impact: MEDIUM (loads after hydration)**
+
+Analytics, logging, and error tracking don't block user interaction. Load them after hydration.
+
+**Incorrect: blocks initial bundle**
+
+```tsx
+import { Analytics } from '@vercel/analytics/react'
+
+export default function RootLayout({ children }) {
+  return (
+    <html>
+      <body>
+        {children}
+        <Analytics />
+      </body>
+    </html>
+  )
+}
+```
+
+**Correct: loads after hydration**
+
+```tsx
+import dynamic from 'next/dynamic'
+
+const Analytics = dynamic(
+  () => import('@vercel/analytics/react').then(m => m.Analytics),
+  { ssr: false }
+)
+
+export default function RootLayout({ children }) {
+  return (
+    <html>
+      <body>
+        {children}
+        <Analytics />
+      </body>
+    </html>
+  )
+}
+```
+
+### 2.4 Dynamic Imports for Heavy Components
+
+**Impact: CRITICAL (directly affects TTI and LCP)**
+
+Use `next/dynamic` to lazy-load large components not needed on initial render.
+
+**Incorrect: Monaco bundles with main chunk ~300KB**
+
+```tsx
+import { MonacoEditor } from './monaco-editor'
+
+function CodePanel({ code }: { code: string }) {
+  return <MonacoEditor value={code} />
+}
+```
+
+**Correct: Monaco loads on demand**
+
+```tsx
+import dynamic from 'next/dynamic'
+
+const MonacoEditor = dynamic(
+  () => import('./monaco-editor').then(m => m.MonacoEditor),
+  { ssr: false }
+)
+
+function CodePanel({ code }: { code: string }) {
+  return <MonacoEditor value={code} />
+}
+```
+
+### 2.5 Preload Based on User Intent
+
+**Impact: MEDIUM (reduces perceived latency)**
+
+Preload heavy bundles before they're needed to reduce perceived latency.
+
+**Example: preload on hover/focus**
+
+```tsx
+function EditorButton({ onClick }: { onClick: () => void }) {
+  const preload = () => {
+    if (typeof window !== 'undefined') {
+      void import('./monaco-editor')
+    }
+  }
+
+  return (
+    <button
+      onMouseEnter={preload}
+      onFocus={preload}
+      onClick={onClick}
+    >
+      Open Editor
+    </button>
+  )
+}
+```
+
+**Example: preload when feature flag is enabled**
+
+```tsx
+function FlagsProvider({ children, flags }: Props) {
+  useEffect(() => {
+    if (flags.editorEnabled && typeof window !== 'undefined') {
+      void import('./monaco-editor').then(mod => mod.init())
+    }
+  }, [flags.editorEnabled])
+
+  return <FlagsContext.Provider value={flags}>
+    {children}
+  </FlagsContext.Provider>
+}
+```
+
+The `typeof window !== 'undefined'` check prevents bundling preloaded modules for SSR, optimizing server bundle size and build speed.
+
+---
+
+## 3. Server-Side Performance
+
+**Impact: HIGH**
+
+Optimizing server-side rendering and data fetching eliminates server-side waterfalls and reduces response times.
+
+### 3.1 Authenticate Server Actions Like API Routes
+
+**Impact: CRITICAL (prevents unauthorized access to server mutations)**
+
+Server Actions (functions with `"use server"`) are exposed as public endpoints, just like API routes. Always verify authentication and authorization **inside** each Server Action—do not rely solely on middleware, layout guards, or page-level checks, as Server Actions can be invoked directly.
+
+Next.js documentation explicitly states: "Treat Server Actions with the same security considerations as public-facing API endpoints, and verify if the user is allowed to perform a mutation."
+
+**Incorrect: no authentication check**
+
+```typescript
+'use server'
+
+export async function deleteUser(userId: string) {
+  // Anyone can call this! No auth check
+  await db.user.delete({ where: { id: userId } })
+  return { success: true }
+}
+```
+
+**Correct: authentication inside the action**
+
+```typescript
+'use server'
+
+import { verifySession } from '@/lib/auth'
+import { unauthorized } from '@/lib/errors'
+
+export async function deleteUser(userId: string) {
+  // Always check auth inside the action
+  const session = await verifySession()
+  
+  if (!session) {
+    throw unauthorized('Must be logged in')
+  }
+  
+  // Check authorization too
+  if (session.user.role !== 'admin' && session.user.id !== userId) {
+    throw unauthorized('Cannot delete other users')
+  }
+  
+  await db.user.delete({ where: { id: userId } })
+  return { success: true }
+}
+```
+
+**With input validation:**
+
+```typescript
+'use server'
+
+import { verifySession } from '@/lib/auth'
+import { z } from 'zod'
+
+const updateProfileSchema = z.object({
+  userId: z.string().uuid(),
+  name: z.string().min(1).max(100),
+  email: z.string().email()
+})
+
+export async function updateProfile(data: unknown) {
+  // Validate input first
+  const validated = updateProfileSchema.parse(data)
+  
+  // Then authenticate
+  const session = await verifySession()
+  if (!session) {
+    throw new Error('Unauthorized')
+  }
+  
+  // Then authorize
+  if (session.user.id !== validated.userId) {
+    throw new Error('Can only update own profile')
+  }
+  
+  // Finally perform the mutation
+  await db.user.update({
+    where: { id: validated.userId },
+    data: {
+      name: validated.name,
+      email: validated.email
+    }
+  })
+  
+  return { success: true }
+}
+```
+
+Reference: [https://nextjs.org/docs/app/guides/authentication](https://nextjs.org/docs/app/guides/authentication)
+
+### 3.2 Avoid Duplicate Serialization in RSC Props
+
+**Impact: LOW (reduces network payload by avoiding duplicate serialization)**
+
+RSC→client serialization deduplicates by object reference, not value. Same reference = serialized once; new reference = serialized again. Do transformations (`.toSorted()`, `.filter()`, `.map()`) in client, not server.
+
+**Incorrect: duplicates array**
+
+```tsx
+// RSC: sends 6 strings (2 arrays × 3 items)
+<ClientList usernames={usernames} usernamesOrdered={usernames.toSorted()} />
+```
+
+**Correct: sends 3 strings**
+
+```tsx
+// RSC: send once
+<ClientList usernames={usernames} />
+
+// Client: transform there
+'use client'
+const sorted = useMemo(() => [...usernames].sort(), [usernames])
+```
+
+**Nested deduplication behavior:**
+
+```tsx
+// string[] - duplicates everything
+usernames={['a','b']} sorted={usernames.toSorted()} // sends 4 strings
+
+// object[] - duplicates array structure only
+users={[{id:1},{id:2}]} sorted={users.toSorted()} // sends 2 arrays + 2 unique objects (not 4)
+```
+
+Deduplication works recursively. Impact varies by data type:
+
+- `string[]`, `number[]`, `boolean[]`: **HIGH impact** - array + all primitives fully duplicated
+
+- `object[]`: **LOW impact** - array duplicated, but nested objects deduplicated by reference
+
+**Operations breaking deduplication: create new references**
+
+- Arrays: `.toSorted()`, `.filter()`, `.map()`, `.slice()`, `[...arr]`
+
+- Objects: `{...obj}`, `Object.assign()`, `structuredClone()`, `JSON.parse(JSON.stringify())`
+
+**More examples:**
+
+```tsx
+// ❌ Bad
+<C users={users} active={users.filter(u => u.active)} />
+<C product={product} productName={product.name} />
+
+// ✅ Good
+<C users={users} />
+<C product={product} />
+// Do filtering/destructuring in client
+```
+
+**Exception:** Pass derived data when transformation is expensive or client doesn't need original.
+
+### 3.3 Cross-Request LRU Caching
+
+**Impact: HIGH (caches across requests)**
+
+`React.cache()` only works within one request. For data shared across sequential requests (user clicks button A then button B), use an LRU cache.
+
+**Implementation:**
+
+```typescript
+import { LRUCache } from 'lru-cache'
+
+const cache = new LRUCache<string, any>({
+  max: 1000,
+  ttl: 5 * 60 * 1000  // 5 minutes
+})
+
+export async function getUser(id: string) {
+  const cached = cache.get(id)
+  if (cached) return cached
+
+  const user = await db.user.findUnique({ where: { id } })
+  cache.set(id, user)
+  return user
+}
+
+// Request 1: DB query, result cached
+// Request 2: cache hit, no DB query
+```
+
+Use when sequential user actions hit multiple endpoints needing the same data within seconds.
+
+**With Vercel's [Fluid Compute](https://vercel.com/docs/fluid-compute):** LRU caching is especially effective because multiple concurrent requests can share the same function instance and cache. This means the cache persists across requests without needing external storage like Redis.
+
+**In traditional serverless:** Each invocation runs in isolation, so consider Redis for cross-process caching.
+
+Reference: [https://github.com/isaacs/node-lru-cache](https://github.com/isaacs/node-lru-cache)
+
+### 3.4 Hoist Static I/O to Module Level
+
+**Impact: HIGH (avoids repeated file/network I/O per request)**
+
+When loading static assets (fonts, logos, images, config files) in route handlers or server functions, hoist the I/O operation to module level. Module-level code runs once when the module is first imported, not on every request. This eliminates redundant file system reads or network fetches that would otherwise run on every invocation.
+
+**Incorrect: reads font file on every request**
+
+**Correct: loads once at module initialization**
+
+**Alternative: synchronous file reads with Node.js fs**
+
+**General Node.js example: loading config or templates**
+
+**When to use this pattern:**
+
+- Loading fonts for OG image generation
+
+- Loading static logos, icons, or watermarks
+
+- Reading configuration files that don't change at runtime
+
+- Loading email templates or other static templates
+
+- Any static asset that's the same across all requests
+
+**When NOT to use this pattern:**
+
+- Assets that vary per request or user
+
+- Files that may change during runtime (use caching with TTL instead)
+
+- Large files that would consume too much memory if kept loaded
+
+- Sensitive data that shouldn't persist in memory
+
+**With Vercel's [Fluid Compute](https://vercel.com/docs/fluid-compute):** Module-level caching is especially effective because multiple concurrent requests share the same function instance. The static assets stay loaded in memory across requests without cold start penalties.
+
+**In traditional serverless:** Each cold start re-executes module-level code, but subsequent warm invocations reuse the loaded assets until the instance is recycled.
+
+### 3.5 Minimize Serialization at RSC Boundaries
+
+**Impact: HIGH (reduces data transfer size)**
+
+The React Server/Client boundary serializes all object properties into strings and embeds them in the HTML response and subsequent RSC requests. This serialized data directly impacts page weight and load time, so **size matters a lot**. Only pass fields that the client actually uses.
+
+**Incorrect: serializes all 50 fields**
+
+```tsx
+async function Page() {
+  const user = await fetchUser()  // 50 fields
+  return <Profile user={user} />
+}
+
+'use client'
+function Profile({ user }: { user: User }) {
+  return <div>{user.name}</div>  // uses 1 field
+}
+```
+
+**Correct: serializes only 1 field**
+
+```tsx
+async function Page() {
+  const user = await fetchUser()
+  return <Profile name={user.name} />
+}
+
+'use client'
+function Profile({ name }: { name: string }) {
+  return <div>{name}</div>
+}
+```
+
+### 3.6 Parallel Data Fetching with Component Composition
+
+**Impact: CRITICAL (eliminates server-side waterfalls)**
+
+React Server Components execute sequentially within a tree. Restructure with composition to parallelize data fetching.
+
+**Incorrect: Sidebar waits for Page's fetch to complete**
+
+```tsx
+export default async function Page() {
+  const header = await fetchHeader()
+  return (
+    <div>
+      <div>{header}</div>
+      <Sidebar />
+    </div>
+  )
+}
+
+async function Sidebar() {
+  const items = await fetchSidebarItems()
+  return <nav>{items.map(renderItem)}</nav>
+}
+```
+
+**Correct: both fetch simultaneously**
+
+```tsx
+async function Header() {
+  const data = await fetchHeader()
+  return <div>{data}</div>
+}
+
+async function Sidebar() {
+  const items = await fetchSidebarItems()
+  return <nav>{items.map(renderItem)}</nav>
+}
+
+export default function Page() {
+  return (
+    <div>
+      <Header />
+      <Sidebar />
+    </div>
+  )
+}
+```
+
+**Alternative with children prop:**
+
+```tsx
+async function Header() {
+  const data = await fetchHeader()
+  return <div>{data}</div>
+}
+
+async function Sidebar() {
+  const items = await fetchSidebarItems()
+  return <nav>{items.map(renderItem)}</nav>
+}
+
+function Layout({ children }: { children: ReactNode }) {
+  return (
+    <div>
+      <Header />
+      {children}
+    </div>
+  )
+}
+
+export default function Page() {
+  return (
+    <Layout>
+      <Sidebar />
+    </Layout>
+  )
+}
+```
+
+### 3.7 Per-Request Deduplication with React.cache()
+
+**Impact: MEDIUM (deduplicates within request)**
+
+Use `React.cache()` for server-side request deduplication. Authentication and database queries benefit most.
+
+**Usage:**
+
+```typescript
+import { cache } from 'react'
+
+export const getCurrentUser = cache(async () => {
+  const session = await auth()
+  if (!session?.user?.id) return null
+  return await db.user.findUnique({
+    where: { id: session.user.id }
+  })
+})
+```
+
+Within a single request, multiple calls to `getCurrentUser()` execute the query only once.
+
+**Avoid inline objects as arguments:**
+
+`React.cache()` uses shallow equality (`Object.is`) to determine cache hits. Inline objects create new references each call, preventing cache hits.
+
+**Incorrect: always cache miss**
+
+```typescript
+const getUser = cache(async (params: { uid: number }) => {
+  return await db.user.findUnique({ where: { id: params.uid } })
+})
+
+// Each call creates new object, never hits cache
+getUser({ uid: 1 })
+getUser({ uid: 1 })  // Cache miss, runs query again
+```
+
+**Correct: cache hit**
+
+```typescript
+const params = { uid: 1 }
+getUser(params)  // Query runs
+getUser(params)  // Cache hit (same reference)
+```
+
+If you must pass objects, pass the same reference:
+
+**Next.js-Specific Note:**
+
+In Next.js, the `fetch` API is automatically extended with request memoization. Requests with the same URL and options are automatically deduplicated within a single request, so you don't need `React.cache()` for `fetch` calls. However, `React.cache()` is still essential for other async tasks:
+
+- Database queries (Prisma, Drizzle, etc.)
+
+- Heavy computations
+
+- Authentication checks
+
+- File system operations
+
+- Any non-fetch async work
+
+Use `React.cache()` to deduplicate these operations across your component tree.
+
+Reference: [https://react.dev/reference/react/cache](https://react.dev/reference/react/cache)
+
+### 3.8 Use after() for Non-Blocking Operations
+
+**Impact: MEDIUM (faster response times)**
+
+Use Next.js's `after()` to schedule work that should execute after a response is sent. This prevents logging, analytics, and other side effects from blocking the response.
+
+**Incorrect: blocks response**
+
+```tsx
+import { logUserAction } from '@/app/utils'
+
+export async function POST(request: Request) {
+  // Perform mutation
+  await updateDatabase(request)
+  
+  // Logging blocks the response
+  const userAgent = request.headers.get('user-agent') || 'unknown'
+  await logUserAction({ userAgent })
+  
+  return new Response(JSON.stringify({ status: 'success' }), {
+    status: 200,
+    headers: { 'Content-Type': 'application/json' }
+  })
+}
+```
+
+**Correct: non-blocking**
+
+```tsx
+import { after } from 'next/server'
+import { headers, cookies } from 'next/headers'
+import { logUserAction } from '@/app/utils'
+
+export async function POST(request: Request) {
+  // Perform mutation
+  await updateDatabase(request)
+  
+  // Log after response is sent
+  after(async () => {
+    const userAgent = (await headers()).get('user-agent') || 'unknown'
+    const sessionCookie = (await cookies()).get('session-id')?.value || 'anonymous'
+    
+    logUserAction({ sessionCookie, userAgent })
+  })
+  
+  return new Response(JSON.stringify({ status: 'success' }), {
+    status: 200,
+    headers: { 'Content-Type': 'application/json' }
+  })
+}
+```
+
+The response is sent immediately while logging happens in the background.
+
+**Common use cases:**
+
+- Analytics tracking
+
+- Audit logging
+
+- Sending notifications
+
+- Cache invalidation
+
+- Cleanup tasks
+
+**Important notes:**
+
+- `after()` runs even if the response fails or redirects
+
+- Works in Server Actions, Route Handlers, and Server Components
+
+Reference: [https://nextjs.org/docs/app/api-reference/functions/after](https://nextjs.org/docs/app/api-reference/functions/after)
+
+---
+
+## 4. Client-Side Data Fetching
+
+**Impact: MEDIUM-HIGH**
+
+Automatic deduplication and efficient data fetching patterns reduce redundant network requests.
+
+### 4.1 Deduplicate Global Event Listeners
+
+**Impact: LOW (single listener for N components)**
+
+Use `useSWRSubscription()` to share global event listeners across component instances.
+
+**Incorrect: N instances = N listeners**
+
+```tsx
+function useKeyboardShortcut(key: string, callback: () => void) {
+  useEffect(() => {
+    const handler = (e: KeyboardEvent) => {
+      if (e.metaKey && e.key === key) {
+        callback()
+      }
+    }
+    window.addEventListener('keydown', handler)
+    return () => window.removeEventListener('keydown', handler)
+  }, [key, callback])
+}
+```
+
+When using the `useKeyboardShortcut` hook multiple times, each instance will register a new listener.
+
+**Correct: N instances = 1 listener**
+
+```tsx
+import useSWRSubscription from 'swr/subscription'
+
+// Module-level Map to track callbacks per key
+const keyCallbacks = new Map<string, Set<() => void>>()
+
+function useKeyboardShortcut(key: string, callback: () => void) {
+  // Register this callback in the Map
+  useEffect(() => {
+    if (!keyCallbacks.has(key)) {
+      keyCallbacks.set(key, new Set())
+    }
+    keyCallbacks.get(key)!.add(callback)
+
+    return () => {
+      const set = keyCallbacks.get(key)
+      if (set) {
+        set.delete(callback)
+        if (set.size === 0) {
+          keyCallbacks.delete(key)
+        }
+      }
+    }
+  }, [key, callback])
+
+  useSWRSubscription('global-keydown', () => {
+    const handler = (e: KeyboardEvent) => {
+      if (e.metaKey && keyCallbacks.has(e.key)) {
+        keyCallbacks.get(e.key)!.forEach(cb => cb())
+      }
+    }
+    window.addEventListener('keydown', handler)
+    return () => window.removeEventListener('keydown', handler)
+  })
+}
+
+function Profile() {
+  // Multiple shortcuts will share the same listener
+  useKeyboardShortcut('p', () => { /* ... */ }) 
+  useKeyboardShortcut('k', () => { /* ... */ })
+  // ...
+}
+```
+
+### 4.2 Use Passive Event Listeners for Scrolling Performance
+
+**Impact: MEDIUM (eliminates scroll delay caused by event listeners)**
+
+Add `{ passive: true }` to touch and wheel event listeners to enable immediate scrolling. Browsers normally wait for listeners to finish to check if `preventDefault()` is called, causing scroll delay.
+
+**Incorrect:**
+
+```typescript
+useEffect(() => {
+  const handleTouch = (e: TouchEvent) => console.log(e.touches[0].clientX)
+  const handleWheel = (e: WheelEvent) => console.log(e.deltaY)
+  
+  document.addEventListener('touchstart', handleTouch)
+  document.addEventListener('wheel', handleWheel)
+  
+  return () => {
+    document.removeEventListener('touchstart', handleTouch)
+    document.removeEventListener('wheel', handleWheel)
+  }
+}, [])
+```
+
+**Correct:**
+
+```typescript
+useEffect(() => {
+  const handleTouch = (e: TouchEvent) => console.log(e.touches[0].clientX)
+  const handleWheel = (e: WheelEvent) => console.log(e.deltaY)
+  
+  document.addEventListener('touchstart', handleTouch, { passive: true })
+  document.addEventListener('wheel', handleWheel, { passive: true })
+  
+  return () => {
+    document.removeEventListener('touchstart', handleTouch)
+    document.removeEventListener('wheel', handleWheel)
+  }
+}, [])
+```
+
+**Use passive when:** tracking/analytics, logging, any listener that doesn't call `preventDefault()`.
+
+**Don't use passive when:** implementing custom swipe gestures, custom zoom controls, or any listener that needs `preventDefault()`.
+
+### 4.3 Use SWR for Automatic Deduplication
+
+**Impact: MEDIUM-HIGH (automatic deduplication)**
+
+SWR enables request deduplication, caching, and revalidation across component instances.
+
+**Incorrect: no deduplication, each instance fetches**
+
+```tsx
+function UserList() {
+  const [users, setUsers] = useState([])
+  useEffect(() => {
+    fetch('/api/users')
+      .then(r => r.json())
+      .then(setUsers)
+  }, [])
+}
+```
+
+**Correct: multiple instances share one request**
+
+```tsx
+import useSWR from 'swr'
+
+function UserList() {
+  const { data: users } = useSWR('/api/users', fetcher)
+}
+```
+
+**For immutable data:**
+
+```tsx
+import { useImmutableSWR } from '@/lib/swr'
+
+function StaticContent() {
+  const { data } = useImmutableSWR('/api/config', fetcher)
+}
+```
+
+**For mutations:**
+
+```tsx
+import { useSWRMutation } from 'swr/mutation'
+
+function UpdateButton() {
+  const { trigger } = useSWRMutation('/api/user', updateUser)
+  return <button onClick={() => trigger()}>Update</button>
+}
+```
+
+Reference: [https://swr.vercel.app](https://swr.vercel.app)
+
+### 4.4 Version and Minimize localStorage Data
+
+**Impact: MEDIUM (prevents schema conflicts, reduces storage size)**
+
+Add version prefix to keys and store only needed fields. Prevents schema conflicts and accidental storage of sensitive data.
+
+**Incorrect:**
+
+```typescript
+// No version, stores everything, no error handling
+localStorage.setItem('userConfig', JSON.stringify(fullUserObject))
+const data = localStorage.getItem('userConfig')
+```
+
+**Correct:**
+
+```typescript
+const VERSION = 'v2'
+
+function saveConfig(config: { theme: string; language: string }) {
+  try {
+    localStorage.setItem(`userConfig:${VERSION}`, JSON.stringify(config))
+  } catch {
+    // Throws in incognito/private browsing, quota exceeded, or disabled
+  }
+}
+
+function loadConfig() {
+  try {
+    const data = localStorage.getItem(`userConfig:${VERSION}`)
+    return data ? JSON.parse(data) : null
+  } catch {
+    return null
+  }
+}
+
+// Migration from v1 to v2
+function migrate() {
+  try {
+    const v1 = localStorage.getItem('userConfig:v1')
+    if (v1) {
+      const old = JSON.parse(v1)
+      saveConfig({ theme: old.darkMode ? 'dark' : 'light', language: old.lang })
+      localStorage.removeItem('userConfig:v1')
+    }
+  } catch {}
+}
+```
+
+**Store minimal fields from server responses:**
+
+```typescript
+// User object has 20+ fields, only store what UI needs
+function cachePrefs(user: FullUser) {
+  try {
+    localStorage.setItem('prefs:v1', JSON.stringify({
+      theme: user.preferences.theme,
+      notifications: user.preferences.notifications
+    }))
+  } catch {}
+}
+```
+
+**Always wrap in try-catch:** `getItem()` and `setItem()` throw in incognito/private browsing (Safari, Firefox), when quota exceeded, or when disabled.
+
+**Benefits:** Schema evolution via versioning, reduced storage size, prevents storing tokens/PII/internal flags.
+
+---
+
+## 5. Re-render Optimization
+
+**Impact: MEDIUM**
+
+Reducing unnecessary re-renders minimizes wasted computation and improves UI responsiveness.
+
+### 5.1 Calculate Derived State During Rendering
+
+**Impact: MEDIUM (avoids redundant renders and state drift)**
+
+If a value can be computed from current props/state, do not store it in state or update it in an effect. Derive it during render to avoid extra renders and state drift. Do not set state in effects solely in response to prop changes; prefer derived values or keyed resets instead.
+
+**Incorrect: redundant state and effect**
+
+```tsx
+function Form() {
+  const [firstName, setFirstName] = useState('First')
+  const [lastName, setLastName] = useState('Last')
+  const [fullName, setFullName] = useState('')
+
+  useEffect(() => {
+    setFullName(firstName + ' ' + lastName)
+  }, [firstName, lastName])
+
+  return <p>{fullName}</p>
+}
+```
+
+**Correct: derive during render**
+
+```tsx
+function Form() {
+  const [firstName, setFirstName] = useState('First')
+  const [lastName, setLastName] = useState('Last')
+  const fullName = firstName + ' ' + lastName
+
+  return <p>{fullName}</p>
+}
+```
+
+Reference: [https://react.dev/learn/you-might-not-need-an-effect](https://react.dev/learn/you-might-not-need-an-effect)
+
+### 5.2 Defer State Reads to Usage Point
+
+**Impact: MEDIUM (avoids unnecessary subscriptions)**
+
+Don't subscribe to dynamic state (searchParams, localStorage) if you only read it inside callbacks.
+
+**Incorrect: subscribes to all searchParams changes**
+
+```tsx
+function ShareButton({ chatId }: { chatId: string }) {
+  const searchParams = useSearchParams()
+
+  const handleShare = () => {
+    const ref = searchParams.get('ref')
+    shareChat(chatId, { ref })
+  }
+
+  return <button onClick={handleShare}>Share</button>
+}
+```
+
+**Correct: reads on demand, no subscription**
+
+```tsx
+function ShareButton({ chatId }: { chatId: string }) {
+  const handleShare = () => {
+    const params = new URLSearchParams(window.location.search)
+    const ref = params.get('ref')
+    shareChat(chatId, { ref })
+  }
+
+  return <button onClick={handleShare}>Share</button>
+}
+```
+
+### 5.3 Do not wrap a simple expression with a primitive result type in useMemo
+
+**Impact: LOW-MEDIUM (wasted computation on every render)**
+
+When an expression is simple (few logical or arithmetical operators) and has a primitive result type (boolean, number, string), do not wrap it in `useMemo`.
+
+Calling `useMemo` and comparing hook dependencies may consume more resources than the expression itself.
+
+**Incorrect:**
+
+```tsx
+function Header({ user, notifications }: Props) {
+  const isLoading = useMemo(() => {
+    return user.isLoading || notifications.isLoading
+  }, [user.isLoading, notifications.isLoading])
+
+  if (isLoading) return <Skeleton />
+  // return some markup
+}
+```
+
+**Correct:**
+
+```tsx
+function Header({ user, notifications }: Props) {
+  const isLoading = user.isLoading || notifications.isLoading
+
+  if (isLoading) return <Skeleton />
+  // return some markup
+}
+```
+
+### 5.4 Don't Define Components Inside Components
+
+**Impact: HIGH (prevents remount on every render)**
+
+Defining a component inside another component creates a new component type on every render. React sees a different component each time and fully remounts it, destroying all state and DOM.
+
+A common reason developers do this is to access parent variables without passing props. Always pass props instead.
+
+**Incorrect: remounts on every render**
+
+```tsx
+function UserProfile({ user, theme }) {
+  // Defined inside to access `theme` - BAD
+  const Avatar = () => (
+    <img
+      src={user.avatarUrl}
+      className={theme === 'dark' ? 'avatar-dark' : 'avatar-light'}
+    />
+  )
+
+  // Defined inside to access `user` - BAD
+  const Stats = () => (
+    <div>
+      <span>{user.followers} followers</span>
+      <span>{user.posts} posts</span>
+    </div>
+  )
+
+  return (
+    <div>
+      <Avatar />
+      <Stats />
+    </div>
+  )
+}
+```
+
+Every time `UserProfile` renders, `Avatar` and `Stats` are new component types. React unmounts the old instances and mounts new ones, losing any internal state, running effects again, and recreating DOM nodes.
+
+**Correct: pass props instead**
+
+```tsx
+function Avatar({ src, theme }: { src: string; theme: string }) {
+  return (
+    <img
+      src={src}
+      className={theme === 'dark' ? 'avatar-dark' : 'avatar-light'}
+    />
+  )
+}
+
+function Stats({ followers, posts }: { followers: number; posts: number }) {
+  return (
+    <div>
+      <span>{followers} followers</span>
+      <span>{posts} posts</span>
+    </div>
+  )
+}
+
+function UserProfile({ user, theme }) {
+  return (
+    <div>
+      <Avatar src={user.avatarUrl} theme={theme} />
+      <Stats followers={user.followers} posts={user.posts} />
+    </div>
+  )
+}
+```
+
+**Symptoms of this bug:**
+
+- Input fields lose focus on every keystroke
+
+- Animations restart unexpectedly
+
+- `useEffect` cleanup/setup runs on every parent render
+
+- Scroll position resets inside the component
+
+### 5.5 Extract Default Non-primitive Parameter Value from Memoized Component to Constant
+
+**Impact: MEDIUM (restores memoization by using a constant for default value)**
+
+When memoized component has a default value for some non-primitive optional parameter, such as an array, function, or object, calling the component without that parameter results in broken memoization. This is because new value instances are created on every rerender, and they do not pass strict equality comparison in `memo()`.
+
+To address this issue, extract the default value into a constant.
+
+**Incorrect: `onClick` has different values on every rerender**
+
+```tsx
+const UserAvatar = memo(function UserAvatar({ onClick = () => {} }: { onClick?: () => void }) {
+  // ...
+})
+
+// Used without optional onClick
+<UserAvatar />
+```
+
+**Correct: stable default value**
+
+```tsx
+const NOOP = () => {};
+
+const UserAvatar = memo(function UserAvatar({ onClick = NOOP }: { onClick?: () => void }) {
+  // ...
+})
+
+// Used without optional onClick
+<UserAvatar />
+```
+
+### 5.6 Extract to Memoized Components
+
+**Impact: MEDIUM (enables early returns)**
+
+Extract expensive work into memoized components to enable early returns before computation.
+
+**Incorrect: computes avatar even when loading**
+
+```tsx
+function Profile({ user, loading }: Props) {
+  const avatar = useMemo(() => {
+    const id = computeAvatarId(user)
+    return <Avatar id={id} />
+  }, [user])
+
+  if (loading) return <Skeleton />
+  return <div>{avatar}</div>
+}
+```
+
+**Correct: skips computation when loading**
+
+```tsx
+const UserAvatar = memo(function UserAvatar({ user }: { user: User }) {
+  const id = useMemo(() => computeAvatarId(user), [user])
+  return <Avatar id={id} />
+})
+
+function Profile({ user, loading }: Props) {
+  if (loading) return <Skeleton />
+  return (
+    <div>
+      <UserAvatar user={user} />
+    </div>
+  )
+}
+```
+
+**Note:** If your project has [React Compiler](https://react.dev/learn/react-compiler) enabled, manual memoization with `memo()` and `useMemo()` is not necessary. The compiler automatically optimizes re-renders.
+
+### 5.7 Narrow Effect Dependencies
+
+**Impact: LOW (minimizes effect re-runs)**
+
+Specify primitive dependencies instead of objects to minimize effect re-runs.
+
+**Incorrect: re-runs on any user field change**
+
+```tsx
+useEffect(() => {
+  console.log(user.id)
+}, [user])
+```
+
+**Correct: re-runs only when id changes**
+
+```tsx
+useEffect(() => {
+  console.log(user.id)
+}, [user.id])
+```
+
+**For derived state, compute outside effect:**
+
+```tsx
+// Incorrect: runs on width=767, 766, 765...
+useEffect(() => {
+  if (width < 768) {
+    enableMobileMode()
+  }
+}, [width])
+
+// Correct: runs only on boolean transition
+const isMobile = width < 768
+useEffect(() => {
+  if (isMobile) {
+    enableMobileMode()
+  }
+}, [isMobile])
+```
+
+### 5.8 Put Interaction Logic in Event Handlers
+
+**Impact: MEDIUM (avoids effect re-runs and duplicate side effects)**
+
+If a side effect is triggered by a specific user action (submit, click, drag), run it in that event handler. Do not model the action as state + effect; it makes effects re-run on unrelated changes and can duplicate the action.
+
+**Incorrect: event modeled as state + effect**
+
+```tsx
+function Form() {
+  const [submitted, setSubmitted] = useState(false)
+  const theme = useContext(ThemeContext)
+
+  useEffect(() => {
+    if (submitted) {
+      post('/api/register')
+      showToast('Registered', theme)
+    }
+  }, [submitted, theme])
+
+  return <button onClick={() => setSubmitted(true)}>Submit</button>
+}
+```
+
+**Correct: do it in the handler**
+
+```tsx
+function Form() {
+  const theme = useContext(ThemeContext)
+
+  function handleSubmit() {
+    post('/api/register')
+    showToast('Registered', theme)
+  }
+
+  return <button onClick={handleSubmit}>Submit</button>
+}
+```
+
+Reference: [https://react.dev/learn/removing-effect-dependencies#should-this-code-move-to-an-event-handler](https://react.dev/learn/removing-effect-dependencies#should-this-code-move-to-an-event-handler)
+
+### 5.9 Subscribe to Derived State
+
+**Impact: MEDIUM (reduces re-render frequency)**
+
+Subscribe to derived boolean state instead of continuous values to reduce re-render frequency.
+
+**Incorrect: re-renders on every pixel change**
+
+```tsx
+function Sidebar() {
+  const width = useWindowWidth()  // updates continuously
+  const isMobile = width < 768
+  return <nav className={isMobile ? 'mobile' : 'desktop'} />
+}
+```
+
+**Correct: re-renders only when boolean changes**
+
+```tsx
+function Sidebar() {
+  const isMobile = useMediaQuery('(max-width: 767px)')
+  return <nav className={isMobile ? 'mobile' : 'desktop'} />
+}
+```
+
+### 5.10 Use Functional setState Updates
+
+**Impact: MEDIUM (prevents stale closures and unnecessary callback recreations)**
+
+When updating state based on the current state value, use the functional update form of setState instead of directly referencing the state variable. This prevents stale closures, eliminates unnecessary dependencies, and creates stable callback references.
+
+**Incorrect: requires state as dependency**
+
+```tsx
+function TodoList() {
+  const [items, setItems] = useState(initialItems)
+  
+  // Callback must depend on items, recreated on every items change
+  const addItems = useCallback((newItems: Item[]) => {
+    setItems([...items, ...newItems])
+  }, [items])  // ❌ items dependency causes recreations
+  
+  // Risk of stale closure if dependency is forgotten
+  const removeItem = useCallback((id: string) => {
+    setItems(items.filter(item => item.id !== id))
+  }, [])  // ❌ Missing items dependency - will use stale items!
+  
+  return <ItemsEditor items={items} onAdd={addItems} onRemove={removeItem} />
+}
+```
+
+The first callback is recreated every time `items` changes, which can cause child components to re-render unnecessarily. The second callback has a stale closure bug—it will always reference the initial `items` value.
+
+**Correct: stable callbacks, no stale closures**
+
+```tsx
+function TodoList() {
+  const [items, setItems] = useState(initialItems)
+  
+  // Stable callback, never recreated
+  const addItems = useCallback((newItems: Item[]) => {
+    setItems(curr => [...curr, ...newItems])
+  }, [])  // ✅ No dependencies needed
+  
+  // Always uses latest state, no stale closure risk
+  const removeItem = useCallback((id: string) => {
+    setItems(curr => curr.filter(item => item.id !== id))
+  }, [])  // ✅ Safe and stable
+  
+  return <ItemsEditor items={items} onAdd={addItems} onRemove={removeItem} />
+}
+```
+
+**Benefits:**
+
+1. **Stable callback references** - Callbacks don't need to be recreated when state changes
+
+2. **No stale closures** - Always operates on the latest state value
+
+3. **Fewer dependencies** - Simplifies dependency arrays and reduces memory leaks
+
+4. **Prevents bugs** - Eliminates the most common source of React closure bugs
+
+**When to use functional updates:**
+
+- Any setState that depends on the current state value
+
+- Inside useCallback/useMemo when state is needed
+
+- Event handlers that reference state
+
+- Async operations that update state
+
+**When direct updates are fine:**
+
+- Setting state to a static value: `setCount(0)`
+
+- Setting state from props/arguments only: `setName(newName)`
+
+- State doesn't depend on previous value
+
+**Note:** If your project has [React Compiler](https://react.dev/learn/react-compiler) enabled, the compiler can automatically optimize some cases, but functional updates are still recommended for correctness and to prevent stale closure bugs.
+
+### 5.11 Use Lazy State Initialization
+
+**Impact: MEDIUM (wasted computation on every render)**
+
+Pass a function to `useState` for expensive initial values. Without the function form, the initializer runs on every render even though the value is only used once.
+
+**Incorrect: runs on every render**
+
+```tsx
+function FilteredList({ items }: { items: Item[] }) {
+  // buildSearchIndex() runs on EVERY render, even after initialization
+  const [searchIndex, setSearchIndex] = useState(buildSearchIndex(items))
+  const [query, setQuery] = useState('')
+  
+  // When query changes, buildSearchIndex runs again unnecessarily
+  return <SearchResults index={searchIndex} query={query} />
+}
+
+function UserProfile() {
+  // JSON.parse runs on every render
+  const [settings, setSettings] = useState(
+    JSON.parse(localStorage.getItem('settings') || '{}')
+  )
+  
+  return <SettingsForm settings={settings} onChange={setSettings} />
+}
+```
+
+**Correct: runs only once**
+
+```tsx
+function FilteredList({ items }: { items: Item[] }) {
+  // buildSearchIndex() runs ONLY on initial render
+  const [searchIndex, setSearchIndex] = useState(() => buildSearchIndex(items))
+  const [query, setQuery] = useState('')
+  
+  return <SearchResults index={searchIndex} query={query} />
+}
+
+function UserProfile() {
+  // JSON.parse runs only on initial render
+  const [settings, setSettings] = useState(() => {
+    const stored = localStorage.getItem('settings')
+    return stored ? JSON.parse(stored) : {}
+  })
+  
+  return <SettingsForm settings={settings} onChange={setSettings} />
+}
+```
+
+Use lazy initialization when computing initial values from localStorage/sessionStorage, building data structures (indexes, maps), reading from the DOM, or performing heavy transformations.
+
+For simple primitives (`useState(0)`), direct references (`useState(props.value)`), or cheap literals (`useState({})`), the function form is unnecessary.
+
+### 5.12 Use Transitions for Non-Urgent Updates
+
+**Impact: MEDIUM (maintains UI responsiveness)**
+
+Mark frequent, non-urgent state updates as transitions to maintain UI responsiveness.
+
+**Incorrect: blocks UI on every scroll**
+
+```tsx
+function ScrollTracker() {
+  const [scrollY, setScrollY] = useState(0)
+  useEffect(() => {
+    const handler = () => setScrollY(window.scrollY)
+    window.addEventListener('scroll', handler, { passive: true })
+    return () => window.removeEventListener('scroll', handler)
+  }, [])
+}
+```
+
+**Correct: non-blocking updates**
+
+```tsx
+import { startTransition } from 'react'
+
+function ScrollTracker() {
+  const [scrollY, setScrollY] = useState(0)
+  useEffect(() => {
+    const handler = () => {
+      startTransition(() => setScrollY(window.scrollY))
+    }
+    window.addEventListener('scroll', handler, { passive: true })
+    return () => window.removeEventListener('scroll', handler)
+  }, [])
+}
+```
+
+### 5.13 Use useRef for Transient Values
+
+**Impact: MEDIUM (avoids unnecessary re-renders on frequent updates)**
+
+When a value changes frequently and you don't want a re-render on every update (e.g., mouse trackers, intervals, transient flags), store it in `useRef` instead of `useState`. Keep component state for UI; use refs for temporary DOM-adjacent values. Updating a ref does not trigger a re-render.
+
+**Incorrect: renders every update**
+
+```tsx
+function Tracker() {
+  const [lastX, setLastX] = useState(0)
+
+  useEffect(() => {
+    const onMove = (e: MouseEvent) => setLastX(e.clientX)
+    window.addEventListener('mousemove', onMove)
+    return () => window.removeEventListener('mousemove', onMove)
+  }, [])
+
+  return (
+    <div
+      style={{
+        position: 'fixed',
+        top: 0,
+        left: lastX,
+        width: 8,
+        height: 8,
+        background: 'black',
+      }}
+    />
+  )
+}
+```
+
+**Correct: no re-render for tracking**
+
+```tsx
+function Tracker() {
+  const lastXRef = useRef(0)
+  const dotRef = useRef<HTMLDivElement>(null)
+
+  useEffect(() => {
+    const onMove = (e: MouseEvent) => {
+      lastXRef.current = e.clientX
+      const node = dotRef.current
+      if (node) {
+        node.style.transform = `translateX(${e.clientX}px)`
+      }
+    }
+    window.addEventListener('mousemove', onMove)
+    return () => window.removeEventListener('mousemove', onMove)
+  }, [])
+
+  return (
+    <div
+      ref={dotRef}
+      style={{
+        position: 'fixed',
+        top: 0,
+        left: 0,
+        width: 8,
+        height: 8,
+        background: 'black',
+        transform: 'translateX(0px)',
+      }}
+    />
+  )
+}
+```
+
+---
+
+## 6. Rendering Performance
+
+**Impact: MEDIUM**
+
+Optimizing the rendering process reduces the work the browser needs to do.
+
+### 6.1 Animate SVG Wrapper Instead of SVG Element
+
+**Impact: LOW (enables hardware acceleration)**
+
+Many browsers don't have hardware acceleration for CSS3 animations on SVG elements. Wrap SVG in a `<div>` and animate the wrapper instead.
+
+**Incorrect: animating SVG directly - no hardware acceleration**
+
+```tsx
+function LoadingSpinner() {
+  return (
+    <svg 
+      className="animate-spin"
+      width="24" 
+      height="24" 
+      viewBox="0 0 24 24"
+    >
+      <circle cx="12" cy="12" r="10" stroke="currentColor" />
+    </svg>
+  )
+}
+```
+
+**Correct: animating wrapper div - hardware accelerated**
+
+```tsx
+function LoadingSpinner() {
+  return (
+    <div className="animate-spin">
+      <svg 
+        width="24" 
+        height="24" 
+        viewBox="0 0 24 24"
+      >
+        <circle cx="12" cy="12" r="10" stroke="currentColor" />
+      </svg>
+    </div>
+  )
+}
+```
+
+This applies to all CSS transforms and transitions (`transform`, `opacity`, `translate`, `scale`, `rotate`). The wrapper div allows browsers to use GPU acceleration for smoother animations.
+
+### 6.2 CSS content-visibility for Long Lists
+
+**Impact: HIGH (faster initial render)**
+
+Apply `content-visibility: auto` to defer off-screen rendering.
+
+**CSS:**
+
+```css
+.message-item {
+  content-visibility: auto;
+  contain-intrinsic-size: 0 80px;
+}
+```
+
+**Example:**
+
+```tsx
+function MessageList({ messages }: { messages: Message[] }) {
+  return (
+    <div className="overflow-y-auto h-screen">
+      {messages.map(msg => (
+        <div key={msg.id} className="message-item">
+          <Avatar user={msg.author} />
+          <div>{msg.content}</div>
+        </div>
+      ))}
+    </div>
+  )
+}
+```
+
+For 1000 messages, browser skips layout/paint for ~990 off-screen items (10× faster initial render).
+
+### 6.3 Hoist Static JSX Elements
+
+**Impact: LOW (avoids re-creation)**
+
+Extract static JSX outside components to avoid re-creation.
+
+**Incorrect: recreates element every render**
+
+```tsx
+function LoadingSkeleton() {
+  return <div className="animate-pulse h-20 bg-gray-200" />
+}
+
+function Container() {
+  return (
+    <div>
+      {loading && <LoadingSkeleton />}
+    </div>
+  )
+}
+```
+
+**Correct: reuses same element**
+
+```tsx
+const loadingSkeleton = (
+  <div className="animate-pulse h-20 bg-gray-200" />
+)
+
+function Container() {
+  return (
+    <div>
+      {loading && loadingSkeleton}
+    </div>
+  )
+}
+```
+
+This is especially helpful for large and static SVG nodes, which can be expensive to recreate on every render.
+
+**Note:** If your project has [React Compiler](https://react.dev/learn/react-compiler) enabled, the compiler automatically hoists static JSX elements and optimizes component re-renders, making manual hoisting unnecessary.
+
+### 6.4 Optimize SVG Precision
+
+**Impact: LOW (reduces file size)**
+
+Reduce SVG coordinate precision to decrease file size. The optimal precision depends on the viewBox size, but in general reducing precision should be considered.
+
+**Incorrect: excessive precision**
+
+```svg
+<path d="M 10.293847 20.847362 L 30.938472 40.192837" />
+```
+
+**Correct: 1 decimal place**
+
+```svg
+<path d="M 10.3 20.8 L 30.9 40.2" />
+```
+
+**Automate with SVGO:**
+
+```bash
+npx svgo --precision=1 --multipass icon.svg
+```
+
+### 6.5 Prevent Hydration Mismatch Without Flickering
+
+**Impact: MEDIUM (avoids visual flicker and hydration errors)**
+
+When rendering content that depends on client-side storage (localStorage, cookies), avoid both SSR breakage and post-hydration flickering by injecting a synchronous script that updates the DOM before React hydrates.
+
+**Incorrect: breaks SSR**
+
+```tsx
+function ThemeWrapper({ children }: { children: ReactNode }) {
+  // localStorage is not available on server - throws error
+  const theme = localStorage.getItem('theme') || 'light'
+  
+  return (
+    <div className={theme}>
+      {children}
+    </div>
+  )
+}
+```
+
+Server-side rendering will fail because `localStorage` is undefined.
+
+**Incorrect: visual flickering**
+
+```tsx
+function ThemeWrapper({ children }: { children: ReactNode }) {
+  const [theme, setTheme] = useState('light')
+  
+  useEffect(() => {
+    // Runs after hydration - causes visible flash
+    const stored = localStorage.getItem('theme')
+    if (stored) {
+      setTheme(stored)
+    }
+  }, [])
+  
+  return (
+    <div className={theme}>
+      {children}
+    </div>
+  )
+}
+```
+
+Component first renders with default value (`light`), then updates after hydration, causing a visible flash of incorrect content.
+
+**Correct: no flicker, no hydration mismatch**
+
+```tsx
+function ThemeWrapper({ children }: { children: ReactNode }) {
+  return (
+    <>
+      <div id="theme-wrapper">
+        {children}
+      </div>
+      <script
+        dangerouslySetInnerHTML={{
+          __html: `
+            (function() {
+              try {
+                var theme = localStorage.getItem('theme') || 'light';
+                var el = document.getElementById('theme-wrapper');
+                if (el) el.className = theme;
+              } catch (e) {}
+            })();
+          `,
+        }}
+      />
+    </>
+  )
+}
+```
+
+The inline script executes synchronously before showing the element, ensuring the DOM already has the correct value. No flickering, no hydration mismatch.
+
+This pattern is especially useful for theme toggles, user preferences, authentication states, and any client-only data that should render immediately without flashing default values.
+
+### 6.6 Suppress Expected Hydration Mismatches
+
+**Impact: LOW-MEDIUM (avoids noisy hydration warnings for known differences)**
+
+In SSR frameworks (e.g., Next.js), some values are intentionally different on server vs client (random IDs, dates, locale/timezone formatting). For these *expected* mismatches, wrap the dynamic text in an element with `suppressHydrationWarning` to prevent noisy warnings. Do not use this to hide real bugs. Don’t overuse it.
+
+**Incorrect: known mismatch warnings**
+
+```tsx
+function Timestamp() {
+  return <span>{new Date().toLocaleString()}</span>
+}
+```
+
+**Correct: suppress expected mismatch only**
+
+```tsx
+function Timestamp() {
+  return (
+    <span suppressHydrationWarning>
+      {new Date().toLocaleString()}
+    </span>
+  )
+}
+```
+
+### 6.7 Use Activity Component for Show/Hide
+
+**Impact: MEDIUM (preserves state/DOM)**
+
+Use React's `<Activity>` to preserve state/DOM for expensive components that frequently toggle visibility.
+
+**Usage:**
+
+```tsx
+import { Activity } from 'react'
+
+function Dropdown({ isOpen }: Props) {
+  return (
+    <Activity mode={isOpen ? 'visible' : 'hidden'}>
+      <ExpensiveMenu />
+    </Activity>
+  )
+}
+```
+
+Avoids expensive re-renders and state loss.
+
+### 6.8 Use defer or async on Script Tags
+
+**Impact: HIGH (eliminates render-blocking)**
+
+Script tags without `defer` or `async` block HTML parsing while the script downloads and executes. This delays First Contentful Paint and Time to Interactive.
+
+- **`defer`**: Downloads in parallel, executes after HTML parsing completes, maintains execution order
+
+- **`async`**: Downloads in parallel, executes immediately when ready, no guaranteed order
+
+Use `defer` for scripts that depend on DOM or other scripts. Use `async` for independent scripts like analytics.
+
+**Incorrect: blocks rendering**
+
+```tsx
+export default function Document() {
+  return (
+    <html>
+      <head>
+        <script src="https://example.com/analytics.js" />
+        <script src="/scripts/utils.js" />
+      </head>
+      <body>{/* content */}</body>
+    </html>
+  )
+}
+```
+
+**Correct: non-blocking**
+
+```tsx
+import Script from 'next/script'
+
+export default function Page() {
+  return (
+    <>
+      <Script src="https://example.com/analytics.js" strategy="afterInteractive" />
+      <Script src="/scripts/utils.js" strategy="beforeInteractive" />
+    </>
+  )
+}
+```
+
+**Note:** In Next.js, prefer the `next/script` component with `strategy` prop instead of raw script tags:
+
+Reference: [https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script#defer](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script#defer)
+
+### 6.9 Use Explicit Conditional Rendering
+
+**Impact: LOW (prevents rendering 0 or NaN)**
+
+Use explicit ternary operators (`? :`) instead of `&&` for conditional rendering when the condition can be `0`, `NaN`, or other falsy values that render.
+
+**Incorrect: renders "0" when count is 0**
+
+```tsx
+function Badge({ count }: { count: number }) {
+  return (
+    <div>
+      {count && <span className="badge">{count}</span>}
+    </div>
+  )
+}
+
+// When count = 0, renders: <div>0</div>
+// When count = 5, renders: <div><span class="badge">5</span></div>
+```
+
+**Correct: renders nothing when count is 0**
+
+```tsx
+function Badge({ count }: { count: number }) {
+  return (
+    <div>
+      {count > 0 ? <span className="badge">{count}</span> : null}
+    </div>
+  )
+}
+
+// When count = 0, renders: <div></div>
+// When count = 5, renders: <div><span class="badge">5</span></div>
+```
+
+### 6.10 Use React DOM Resource Hints
+
+**Impact: HIGH (reduces load time for critical resources)**
+
+React DOM provides APIs to hint the browser about resources it will need. These are especially useful in server components to start loading resources before the client even receives the HTML.
+
+- **`prefetchDNS(href)`**: Resolve DNS for a domain you expect to connect to
+
+- **`preconnect(href)`**: Establish connection (DNS + TCP + TLS) to a server
+
+- **`preload(href, options)`**: Fetch a resource (stylesheet, font, script, image) you'll use soon
+
+- **`preloadModule(href)`**: Fetch an ES module you'll use soon
+
+- **`preinit(href, options)`**: Fetch and evaluate a stylesheet or script
+
+- **`preinitModule(href)`**: Fetch and evaluate an ES module
+
+**Example: preconnect to third-party APIs**
+
+```tsx
+import { preconnect, prefetchDNS } from 'react-dom'
+
+export default function App() {
+  prefetchDNS('https://analytics.example.com')
+  preconnect('https://api.example.com')
+
+  return <main>{/* content */}</main>
+}
+```
+
+**Example: preload critical fonts and styles**
+
+```tsx
+import { preload, preinit } from 'react-dom'
+
+export default function RootLayout({ children }) {
+  // Preload font file
+  preload('/fonts/inter.woff2', { as: 'font', type: 'font/woff2', crossOrigin: 'anonymous' })
+
+  // Fetch and apply critical stylesheet immediately
+  preinit('/styles/critical.css', { as: 'style' })
+
+  return (
+    <html>
+      <body>{children}</body>
+    </html>
+  )
+}
+```
+
+**Example: preload modules for code-split routes**
+
+```tsx
+import { preloadModule, preinitModule } from 'react-dom'
+
+function Navigation() {
+  const preloadDashboard = () => {
+    preloadModule('/dashboard.js', { as: 'script' })
+  }
+
+  return (
+    <nav>
+      <a href="/dashboard" onMouseEnter={preloadDashboard}>
+        Dashboard
+      </a>
+    </nav>
+  )
+}
+```
+
+**When to use each:**
+
+| API | Use case |
+
+|-----|----------|
+
+| `prefetchDNS` | Third-party domains you'll connect to later |
+
+| `preconnect` | APIs or CDNs you'll fetch from immediately |
+
+| `preload` | Critical resources needed for current page |
+
+| `preloadModule` | JS modules for likely next navigation |
+
+| `preinit` | Stylesheets/scripts that must execute early |
+
+| `preinitModule` | ES modules that must execute early |
+
+Reference: [https://react.dev/reference/react-dom#resource-preloading-apis](https://react.dev/reference/react-dom#resource-preloading-apis)
+
+### 6.11 Use useTransition Over Manual Loading States
+
+**Impact: LOW (reduces re-renders and improves code clarity)**
+
+Use `useTransition` instead of manual `useState` for loading states. This provides built-in `isPending` state and automatically manages transitions.
+
+**Incorrect: manual loading state**
+
+```tsx
+function SearchResults() {
+  const [query, setQuery] = useState('')
+  const [results, setResults] = useState([])
+  const [isLoading, setIsLoading] = useState(false)
+
+  const handleSearch = async (value: string) => {
+    setIsLoading(true)
+    setQuery(value)
+    const data = await fetchResults(value)
+    setResults(data)
+    setIsLoading(false)
+  }
+
+  return (
+    <>
+      <input onChange={(e) => handleSearch(e.target.value)} />
+      {isLoading && <Spinner />}
+      <ResultsList results={results} />
+    </>
+  )
+}
+```
+
+**Correct: useTransition with built-in pending state**
+
+```tsx
+import { useTransition, useState } from 'react'
+
+function SearchResults() {
+  const [query, setQuery] = useState('')
+  const [results, setResults] = useState([])
+  const [isPending, startTransition] = useTransition()
+
+  const handleSearch = (value: string) => {
+    setQuery(value) // Update input immediately
+    
+    startTransition(async () => {
+      // Fetch and update results
+      const data = await fetchResults(value)
+      setResults(data)
+    })
+  }
+
+  return (
+    <>
+      <input onChange={(e) => handleSearch(e.target.value)} />
+      {isPending && <Spinner />}
+      <ResultsList results={results} />
+    </>
+  )
+}
+```
+
+**Benefits:**
+
+- **Automatic pending state**: No need to manually manage `setIsLoading(true/false)`
+
+- **Error resilience**: Pending state correctly resets even if the transition throws
+
+- **Better responsiveness**: Keeps the UI responsive during updates
+
+- **Interrupt handling**: New transitions automatically cancel pending ones
+
+Reference: [https://react.dev/reference/react/useTransition](https://react.dev/reference/react/useTransition)
+
+---
+
+## 7. JavaScript Performance
+
+**Impact: LOW-MEDIUM**
+
+Micro-optimizations for hot paths can add up to meaningful improvements.
+
+### 7.1 Avoid Layout Thrashing
+
+**Impact: MEDIUM (prevents forced synchronous layouts and reduces performance bottlenecks)**
+
+Avoid interleaving style writes with layout reads. When you read a layout property (like `offsetWidth`, `getBoundingClientRect()`, or `getComputedStyle()`) between style changes, the browser is forced to trigger a synchronous reflow.
+
+**This is OK: browser batches style changes**
+
+```typescript
+function updateElementStyles(element: HTMLElement) {
+  // Each line invalidates style, but browser batches the recalculation
+  element.style.width = '100px'
+  element.style.height = '200px'
+  element.style.backgroundColor = 'blue'
+  element.style.border = '1px solid black'
+}
+```
+
+**Incorrect: interleaved reads and writes force reflows**
+
+```typescript
+function layoutThrashing(element: HTMLElement) {
+  element.style.width = '100px'
+  const width = element.offsetWidth  // Forces reflow
+  element.style.height = '200px'
+  const height = element.offsetHeight  // Forces another reflow
+}
+```
+
+**Correct: batch writes, then read once**
+
+```typescript
+function updateElementStyles(element: HTMLElement) {
+  // Batch all writes together
+  element.style.width = '100px'
+  element.style.height = '200px'
+  element.style.backgroundColor = 'blue'
+  element.style.border = '1px solid black'
+  
+  // Read after all writes are done (single reflow)
+  const { width, height } = element.getBoundingClientRect()
+}
+```
+
+**Correct: batch reads, then writes**
+
+```typescript
+function updateElementStyles(element: HTMLElement) {
+  element.classList.add('highlighted-box')
+  
+  const { width, height } = element.getBoundingClientRect()
+}
+```
+
+**Better: use CSS classes**
+
+**React example:**
+
+```tsx
+// Incorrect: interleaving style changes with layout queries
+function Box({ isHighlighted }: { isHighlighted: boolean }) {
+  const ref = useRef<HTMLDivElement>(null)
+  
+  useEffect(() => {
+    if (ref.current && isHighlighted) {
+      ref.current.style.width = '100px'
+      const width = ref.current.offsetWidth // Forces layout
+      ref.current.style.height = '200px'
+    }
+  }, [isHighlighted])
+  
+  return <div ref={ref}>Content</div>
+}
+
+// Correct: toggle class
+function Box({ isHighlighted }: { isHighlighted: boolean }) {
+  return (
+    <div className={isHighlighted ? 'highlighted-box' : ''}>
+      Content
+    </div>
+  )
+}
+```
+
+Prefer CSS classes over inline styles when possible. CSS files are cached by the browser, and classes provide better separation of concerns and are easier to maintain.
+
+See [this gist](https://gist.github.com/paulirish/5d52fb081b3570c81e3a) and [CSS Triggers](https://csstriggers.com/) for more information on layout-forcing operations.
+
+### 7.2 Build Index Maps for Repeated Lookups
+
+**Impact: LOW-MEDIUM (1M ops to 2K ops)**
+
+Multiple `.find()` calls by the same key should use a Map.
+
+**Incorrect (O(n) per lookup):**
+
+```typescript
+function processOrders(orders: Order[], users: User[]) {
+  return orders.map(order => ({
+    ...order,
+    user: users.find(u => u.id === order.userId)
+  }))
+}
+```
+
+**Correct (O(1) per lookup):**
+
+```typescript
+function processOrders(orders: Order[], users: User[]) {
+  const userById = new Map(users.map(u => [u.id, u]))
+
+  return orders.map(order => ({
+    ...order,
+    user: userById.get(order.userId)
+  }))
+}
+```
+
+Build map once (O(n)), then all lookups are O(1).
+
+For 1000 orders × 1000 users: 1M ops → 2K ops.
+
+### 7.3 Cache Property Access in Loops
+
+**Impact: LOW-MEDIUM (reduces lookups)**
+
+Cache object property lookups in hot paths.
+
+**Incorrect: 3 lookups × N iterations**
+
+```typescript
+for (let i = 0; i < arr.length; i++) {
+  process(obj.config.settings.value)
+}
+```
+
+**Correct: 1 lookup total**
+
+```typescript
+const value = obj.config.settings.value
+const len = arr.length
+for (let i = 0; i < len; i++) {
+  process(value)
+}
+```
+
+### 7.4 Cache Repeated Function Calls
+
+**Impact: MEDIUM (avoid redundant computation)**
+
+Use a module-level Map to cache function results when the same function is called repeatedly with the same inputs during render.
+
+**Incorrect: redundant computation**
+
+```typescript
+function ProjectList({ projects }: { projects: Project[] }) {
+  return (
+    <div>
+      {projects.map(project => {
+        // slugify() called 100+ times for same project names
+        const slug = slugify(project.name)
+        
+        return <ProjectCard key={project.id} slug={slug} />
+      })}
+    </div>
+  )
+}
+```
+
+**Correct: cached results**
+
+```typescript
+// Module-level cache
+const slugifyCache = new Map<string, string>()
+
+function cachedSlugify(text: string): string {
+  if (slugifyCache.has(text)) {
+    return slugifyCache.get(text)!
+  }
+  const result = slugify(text)
+  slugifyCache.set(text, result)
+  return result
+}
+
+function ProjectList({ projects }: { projects: Project[] }) {
+  return (
+    <div>
+      {projects.map(project => {
+        // Computed only once per unique project name
+        const slug = cachedSlugify(project.name)
+        
+        return <ProjectCard key={project.id} slug={slug} />
+      })}
+    </div>
+  )
+}
+```
+
+**Simpler pattern for single-value functions:**
+
+```typescript
+let isLoggedInCache: boolean | null = null
+
+function isLoggedIn(): boolean {
+  if (isLoggedInCache !== null) {
+    return isLoggedInCache
+  }
+  
+  isLoggedInCache = document.cookie.includes('auth=')
+  return isLoggedInCache
+}
+
+// Clear cache when auth changes
+function onAuthChange() {
+  isLoggedInCache = null
+}
+```
+
+Use a Map (not a hook) so it works everywhere: utilities, event handlers, not just React components.
+
+Reference: [https://vercel.com/blog/how-we-made-the-vercel-dashboard-twice-as-fast](https://vercel.com/blog/how-we-made-the-vercel-dashboard-twice-as-fast)
+
+### 7.5 Cache Storage API Calls
+
+**Impact: LOW-MEDIUM (reduces expensive I/O)**
+
+`localStorage`, `sessionStorage`, and `document.cookie` are synchronous and expensive. Cache reads in memory.
+
+**Incorrect: reads storage on every call**
+
+```typescript
+function getTheme() {
+  return localStorage.getItem('theme') ?? 'light'
+}
+// Called 10 times = 10 storage reads
+```
+
+**Correct: Map cache**
+
+```typescript
+const storageCache = new Map<string, string | null>()
+
+function getLocalStorage(key: string) {
+  if (!storageCache.has(key)) {
+    storageCache.set(key, localStorage.getItem(key))
+  }
+  return storageCache.get(key)
+}
+
+function setLocalStorage(key: string, value: string) {
+  localStorage.setItem(key, value)
+  storageCache.set(key, value)  // keep cache in sync
+}
+```
+
+Use a Map (not a hook) so it works everywhere: utilities, event handlers, not just React components.
+
+**Cookie caching:**
+
+```typescript
+let cookieCache: Record<string, string> | null = null
+
+function getCookie(name: string) {
+  if (!cookieCache) {
+    cookieCache = Object.fromEntries(
+      document.cookie.split('; ').map(c => c.split('='))
+    )
+  }
+  return cookieCache[name]
+}
+```
+
+**Important: invalidate on external changes**
+
+```typescript
+window.addEventListener('storage', (e) => {
+  if (e.key) storageCache.delete(e.key)
+})
+
+document.addEventListener('visibilitychange', () => {
+  if (document.visibilityState === 'visible') {
+    storageCache.clear()
+  }
+})
+```
+
+If storage can change externally (another tab, server-set cookies), invalidate cache:
+
+### 7.6 Combine Multiple Array Iterations
+
+**Impact: LOW-MEDIUM (reduces iterations)**
+
+Multiple `.filter()` or `.map()` calls iterate the array multiple times. Combine into one loop.
+
+**Incorrect: 3 iterations**
+
+```typescript
+const admins = users.filter(u => u.isAdmin)
+const testers = users.filter(u => u.isTester)
+const inactive = users.filter(u => !u.isActive)
+```
+
+**Correct: 1 iteration**
+
+```typescript
+const admins: User[] = []
+const testers: User[] = []
+const inactive: User[] = []
+
+for (const user of users) {
+  if (user.isAdmin) admins.push(user)
+  if (user.isTester) testers.push(user)
+  if (!user.isActive) inactive.push(user)
+}
+```
+
+### 7.7 Early Length Check for Array Comparisons
+
+**Impact: MEDIUM-HIGH (avoids expensive operations when lengths differ)**
+
+When comparing arrays with expensive operations (sorting, deep equality, serialization), check lengths first. If lengths differ, the arrays cannot be equal.
+
+In real-world applications, this optimization is especially valuable when the comparison runs in hot paths (event handlers, render loops).
+
+**Incorrect: always runs expensive comparison**
+
+```typescript
+function hasChanges(current: string[], original: string[]) {
+  // Always sorts and joins, even when lengths differ
+  return current.sort().join() !== original.sort().join()
+}
+```
+
+Two O(n log n) sorts run even when `current.length` is 5 and `original.length` is 100. There is also overhead of joining the arrays and comparing the strings.
+
+**Correct (O(1) length check first):**
+
+```typescript
+function hasChanges(current: string[], original: string[]) {
+  // Early return if lengths differ
+  if (current.length !== original.length) {
+    return true
+  }
+  // Only sort when lengths match
+  const currentSorted = current.toSorted()
+  const originalSorted = original.toSorted()
+  for (let i = 0; i < currentSorted.length; i++) {
+    if (currentSorted[i] !== originalSorted[i]) {
+      return true
+    }
+  }
+  return false
+}
+```
+
+This new approach is more efficient because:
+
+- It avoids the overhead of sorting and joining the arrays when lengths differ
+
+- It avoids consuming memory for the joined strings (especially important for large arrays)
+
+- It avoids mutating the original arrays
+
+- It returns early when a difference is found
+
+### 7.8 Early Return from Functions
+
+**Impact: LOW-MEDIUM (avoids unnecessary computation)**
+
+Return early when result is determined to skip unnecessary processing.
+
+**Incorrect: processes all items even after finding answer**
+
+```typescript
+function validateUsers(users: User[]) {
+  let hasError = false
+  let errorMessage = ''
+  
+  for (const user of users) {
+    if (!user.email) {
+      hasError = true
+      errorMessage = 'Email required'
+    }
+    if (!user.name) {
+      hasError = true
+      errorMessage = 'Name required'
+    }
+    // Continues checking all users even after error found
+  }
+  
+  return hasError ? { valid: false, error: errorMessage } : { valid: true }
+}
+```
+
+**Correct: returns immediately on first error**
+
+```typescript
+function validateUsers(users: User[]) {
+  for (const user of users) {
+    if (!user.email) {
+      return { valid: false, error: 'Email required' }
+    }
+    if (!user.name) {
+      return { valid: false, error: 'Name required' }
+    }
+  }
+
+  return { valid: true }
+}
+```
+
+### 7.9 Hoist RegExp Creation
+
+**Impact: LOW-MEDIUM (avoids recreation)**
+
+Don't create RegExp inside render. Hoist to module scope or memoize with `useMemo()`.
+
+**Incorrect: new RegExp every render**
+
+```tsx
+function Highlighter({ text, query }: Props) {
+  const regex = new RegExp(`(${query})`, 'gi')
+  const parts = text.split(regex)
+  return <>{parts.map((part, i) => ...)}</>
+}
+```
+
+**Correct: memoize or hoist**
+
+```tsx
+const EMAIL_REGEX = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
+
+function Highlighter({ text, query }: Props) {
+  const regex = useMemo(
+    () => new RegExp(`(${escapeRegex(query)})`, 'gi'),
+    [query]
+  )
+  const parts = text.split(regex)
+  return <>{parts.map((part, i) => ...)}</>
+}
+```
+
+**Warning: global regex has mutable state**
+
+```typescript
+const regex = /foo/g
+regex.test('foo')  // true, lastIndex = 3
+regex.test('foo')  // false, lastIndex = 0
+```
+
+Global regex (`/g`) has mutable `lastIndex` state:
+
+### 7.10 Use flatMap to Map and Filter in One Pass
+
+**Impact: LOW-MEDIUM (eliminates intermediate array)**
+
+Chaining `.map().filter(Boolean)` creates an intermediate array and iterates twice. Use `.flatMap()` to transform and filter in a single pass.
+
+**Incorrect: 2 iterations, intermediate array**
+
+```typescript
+const userNames = users
+  .map(user => user.isActive ? user.name : null)
+  .filter(Boolean)
+```
+
+**Correct: 1 iteration, no intermediate array**
+
+```typescript
+const userNames = users.flatMap(user =>
+  user.isActive ? [user.name] : []
+)
+```
+
+**More examples:**
+
+```typescript
+// Extract valid emails from responses
+// Before
+const emails = responses
+  .map(r => r.success ? r.data.email : null)
+  .filter(Boolean)
+
+// After
+const emails = responses.flatMap(r =>
+  r.success ? [r.data.email] : []
+)
+
+// Parse and filter valid numbers
+// Before
+const numbers = strings
+  .map(s => parseInt(s, 10))
+  .filter(n => !isNaN(n))
+
+// After
+const numbers = strings.flatMap(s => {
+  const n = parseInt(s, 10)
+  return isNaN(n) ? [] : [n]
+})
+```
+
+**When to use:**
+
+- Transforming items while filtering some out
+
+- Conditional mapping where some inputs produce no output
+
+- Parsing/validating where invalid inputs should be skipped
+
+### 7.11 Use Loop for Min/Max Instead of Sort
+
+**Impact: LOW (O(n) instead of O(n log n))**
+
+Finding the smallest or largest element only requires a single pass through the array. Sorting is wasteful and slower.
+
+**Incorrect (O(n log n) - sort to find latest):**
+
+```typescript
+interface Project {
+  id: string
+  name: string
+  updatedAt: number
+}
+
+function getLatestProject(projects: Project[]) {
+  const sorted = [...projects].sort((a, b) => b.updatedAt - a.updatedAt)
+  return sorted[0]
+}
+```
+
+Sorts the entire array just to find the maximum value.
+
+**Incorrect (O(n log n) - sort for oldest and newest):**
+
+```typescript
+function getOldestAndNewest(projects: Project[]) {
+  const sorted = [...projects].sort((a, b) => a.updatedAt - b.updatedAt)
+  return { oldest: sorted[0], newest: sorted[sorted.length - 1] }
+}
+```
+
+Still sorts unnecessarily when only min/max are needed.
+
+**Correct (O(n) - single loop):**
+
+```typescript
+function getLatestProject(projects: Project[]) {
+  if (projects.length === 0) return null
+  
+  let latest = projects[0]
+  
+  for (let i = 1; i < projects.length; i++) {
+    if (projects[i].updatedAt > latest.updatedAt) {
+      latest = projects[i]
+    }
+  }
+  
+  return latest
+}
+
+function getOldestAndNewest(projects: Project[]) {
+  if (projects.length === 0) return { oldest: null, newest: null }
+  
+  let oldest = projects[0]
+  let newest = projects[0]
+  
+  for (let i = 1; i < projects.length; i++) {
+    if (projects[i].updatedAt < oldest.updatedAt) oldest = projects[i]
+    if (projects[i].updatedAt > newest.updatedAt) newest = projects[i]
+  }
+  
+  return { oldest, newest }
+}
+```
+
+Single pass through the array, no copying, no sorting.
+
+**Alternative: Math.min/Math.max for small arrays**
+
+```typescript
+const numbers = [5, 2, 8, 1, 9]
+const min = Math.min(...numbers)
+const max = Math.max(...numbers)
+```
+
+This works for small arrays, but can be slower or just throw an error for very large arrays due to spread operator limitations. Maximal array length is approximately 124000 in Chrome 143 and 638000 in Safari 18; exact numbers may vary - see [the fiddle](https://jsfiddle.net/qw1jabsx/4/). Use the loop approach for reliability.
+
+### 7.12 Use Set/Map for O(1) Lookups
+
+**Impact: LOW-MEDIUM (O(n) to O(1))**
+
+Convert arrays to Set/Map for repeated membership checks.
+
+**Incorrect (O(n) per check):**
+
+```typescript
+const allowedIds = ['a', 'b', 'c', ...]
+items.filter(item => allowedIds.includes(item.id))
+```
+
+**Correct (O(1) per check):**
+
+```typescript
+const allowedIds = new Set(['a', 'b', 'c', ...])
+items.filter(item => allowedIds.has(item.id))
+```
+
+### 7.13 Use toSorted() Instead of sort() for Immutability
+
+**Impact: MEDIUM-HIGH (prevents mutation bugs in React state)**
+
+`.sort()` mutates the array in place, which can cause bugs with React state and props. Use `.toSorted()` to create a new sorted array without mutation.
+
+**Incorrect: mutates original array**
+
+```typescript
+function UserList({ users }: { users: User[] }) {
+  // Mutates the users prop array!
+  const sorted = useMemo(
+    () => users.sort((a, b) => a.name.localeCompare(b.name)),
+    [users]
+  )
+  return <div>{sorted.map(renderUser)}</div>
+}
+```
+
+**Correct: creates new array**
+
+```typescript
+function UserList({ users }: { users: User[] }) {
+  // Creates new sorted array, original unchanged
+  const sorted = useMemo(
+    () => users.toSorted((a, b) => a.name.localeCompare(b.name)),
+    [users]
+  )
+  return <div>{sorted.map(renderUser)}</div>
+}
+```
+
+**Why this matters in React:**
+
+1. Props/state mutations break React's immutability model - React expects props and state to be treated as read-only
+
+2. Causes stale closure bugs - Mutating arrays inside closures (callbacks, effects) can lead to unexpected behavior
+
+**Browser support: fallback for older browsers**
+
+```typescript
+// Fallback for older browsers
+const sorted = [...items].sort((a, b) => a.value - b.value)
+```
+
+`.toSorted()` is available in all modern browsers (Chrome 110+, Safari 16+, Firefox 115+, Node.js 20+). For older environments, use spread operator:
+
+**Other immutable array methods:**
+
+- `.toSorted()` - immutable sort
+
+- `.toReversed()` - immutable reverse
+
+- `.toSpliced()` - immutable splice
+
+- `.with()` - immutable element replacement
+
+---
+
+## 8. Advanced Patterns
+
+**Impact: LOW**
+
+Advanced patterns for specific cases that require careful implementation.
+
+### 8.1 Initialize App Once, Not Per Mount
+
+**Impact: LOW-MEDIUM (avoids duplicate init in development)**
+
+Do not put app-wide initialization that must run once per app load inside `useEffect([])` of a component. Components can remount and effects will re-run. Use a module-level guard or top-level init in the entry module instead.
+
+**Incorrect: runs twice in dev, re-runs on remount**
+
+```tsx
+function Comp() {
+  useEffect(() => {
+    loadFromStorage()
+    checkAuthToken()
+  }, [])
+
+  // ...
+}
+```
+
+**Correct: once per app load**
+
+```tsx
+let didInit = false
+
+function Comp() {
+  useEffect(() => {
+    if (didInit) return
+    didInit = true
+    loadFromStorage()
+    checkAuthToken()
+  }, [])
+
+  // ...
+}
+```
+
+Reference: [https://react.dev/learn/you-might-not-need-an-effect#initializing-the-application](https://react.dev/learn/you-might-not-need-an-effect#initializing-the-application)
+
+### 8.2 Store Event Handlers in Refs
+
+**Impact: LOW (stable subscriptions)**
+
+Store callbacks in refs when used in effects that shouldn't re-subscribe on callback changes.
+
+**Incorrect: re-subscribes on every render**
+
+```tsx
+function useWindowEvent(event: string, handler: (e) => void) {
+  useEffect(() => {
+    window.addEventListener(event, handler)
+    return () => window.removeEventListener(event, handler)
+  }, [event, handler])
+}
+```
+
+**Correct: stable subscription**
+
+```tsx
+import { useEffectEvent } from 'react'
+
+function useWindowEvent(event: string, handler: (e) => void) {
+  const onEvent = useEffectEvent(handler)
+
+  useEffect(() => {
+    window.addEventListener(event, onEvent)
+    return () => window.removeEventListener(event, onEvent)
+  }, [event])
+}
+```
+
+**Alternative: use `useEffectEvent` if you're on latest React:**
+
+`useEffectEvent` provides a cleaner API for the same pattern: it creates a stable function reference that always calls the latest version of the handler.
+
+### 8.3 useEffectEvent for Stable Callback Refs
+
+**Impact: LOW (prevents effect re-runs)**
+
+Access latest values in callbacks without adding them to dependency arrays. Prevents effect re-runs while avoiding stale closures.
+
+**Incorrect: effect re-runs on every callback change**
+
+```tsx
+function SearchInput({ onSearch }: { onSearch: (q: string) => void }) {
+  const [query, setQuery] = useState('')
+
+  useEffect(() => {
+    const timeout = setTimeout(() => onSearch(query), 300)
+    return () => clearTimeout(timeout)
+  }, [query, onSearch])
+}
+```
+
+**Correct: using React's useEffectEvent**
+
+```tsx
+import { useEffectEvent } from 'react';
+
+function SearchInput({ onSearch }: { onSearch: (q: string) => void }) {
+  const [query, setQuery] = useState('')
+  const onSearchEvent = useEffectEvent(onSearch)
+
+  useEffect(() => {
+    const timeout = setTimeout(() => onSearchEvent(query), 300)
+    return () => clearTimeout(timeout)
+  }, [query])
+}
+```
+
+---
+
+## References
+
+1. [https://react.dev](https://react.dev)
+2. [https://nextjs.org](https://nextjs.org)
+3. [https://swr.vercel.app](https://swr.vercel.app)
+4. [https://github.com/shuding/better-all](https://github.com/shuding/better-all)
+5. [https://github.com/isaacs/node-lru-cache](https://github.com/isaacs/node-lru-cache)
+6. [https://vercel.com/blog/how-we-optimized-package-imports-in-next-js](https://vercel.com/blog/how-we-optimized-package-imports-in-next-js)
+7. [https://vercel.com/blog/how-we-made-the-vercel-dashboard-twice-as-fast](https://vercel.com/blog/how-we-made-the-vercel-dashboard-twice-as-fast)
diff --git a/.github/skills/react-best-practices/APPLICABILITY.md b/.github/skills/react-best-practices/APPLICABILITY.md
new file mode 100644
index 0000000..838fc88
--- /dev/null
+++ b/.github/skills/react-best-practices/APPLICABILITY.md
@@ -0,0 +1,46 @@
+# React Best Practices Applicability for Ground Truth Curator
+
+Use this file as the repo-specific override for `.github/skills/react-best-practices/SKILL.md`.
+
+The upstream skill mixes framework-agnostic React guidance with Next.js- and SWR-specific guidance. This repository's `frontend/` app is a Vite + React 19 + TypeScript application, so agents should use the matrix below before applying rules mechanically.
+
+## Repo Facts
+
+- Frontend stack: Vite, React 19, TypeScript, Tailwind via Vite plugin.
+- Frontend data access uses `fetch` / `openapi-fetch` helpers in `frontend/src/api/` and `frontend/src/services/`.
+- Frontend validation uses Biome, Vitest, and `npm run typecheck`.
+- Backend routes and server workflows live in FastAPI under `backend/app/api/v1/`, not in Next.js route handlers.
+
+## Status Legend
+
+- **Applies**: Use the rule normally in this repo.
+- **Applies with adaptation**: Keep the principle, but translate framework-specific examples to Vite/React patterns.
+- **Conditional**: Use only when the affected rendering mode or architecture is actually present.
+- **Not applicable**: Do not apply this rule in the current repo stack.
+- **Reference only**: Keep as background context; do not introduce the referenced library or framework pattern without an explicit stack change.
+
+## Rule Family Matrix
+
+| Rule family | Status | Repo-specific guidance |
+| --- | --- | --- |
+| `async-*` | Applies with adaptation | Use `async-parallel`, `async-defer-await`, and `async-dependencies` for client orchestration. Treat `async-api-routes` as not applicable, and use `async-suspense-boundaries` only when the component tree already uses Suspense intentionally. |
+| `bundle-*` | Applies with adaptation | Keep direct imports, conditional loading, and defer-heavy-third-party guidance. Translate `bundle-dynamic-imports` to `React.lazy()` or dynamic `import()` patterns instead of `next/dynamic`. |
+| `server-*` | Not applicable | These rules target Next.js or server-rendered React patterns. For this repo, frontend work should not apply them; backend server work belongs in FastAPI and follows backend architecture rules instead. |
+| `client-*` | Applies with adaptation | Event-listener and localStorage guidance applies. Treat `client-swr-dedup` as reference only because the repo uses `fetch` / `openapi-fetch`, not SWR. |
+| `rerender-*` | Applies | These rules are generally useful for React component structure, state subscriptions, and avoiding unnecessary work. |
+| `rendering-*` | Conditional | DOM/rendering rules usually apply, but hydration- and resource-hint-specific guidance depends on the actual rendering path. Use only when the component or page architecture warrants it. |
+| `js-*` | Applies | General JavaScript hot-path guidance applies when it improves measured performance without harming readability. |
+| `advanced-*` | Conditional | Use only for real hotspots or stable abstractions. Avoid adding complexity speculatively. |
+
+## Common Translations
+
+- `next/dynamic` -> `React.lazy()` with `Suspense`, route-level code splitting, or gated `import()`.
+- Next.js API routes, server actions, `after()`, and RSC/server-only caching -> not applicable in `frontend/`.
+- SWR -> keep using the repo's existing `fetch` / `openapi-fetch` helpers unless SWR is explicitly added.
+- Server serialization and cross-request caching guidance -> reference only for architectural thinking, not direct frontend implementation.
+
+## When In Doubt
+
+- Prefer framework-agnostic React, rendering, re-render, bundle, and JavaScript rules first.
+- If a rule example imports from `next/*` or assumes server components, treat it as not applicable unless the stack changes.
+- If a rule assumes a library that is not already in `frontend/package.json`, treat it as reference only until that dependency is intentionally adopted.
diff --git a/.github/skills/react-best-practices/README.md b/.github/skills/react-best-practices/README.md
new file mode 100644
index 0000000..ff173ca
--- /dev/null
+++ b/.github/skills/react-best-practices/README.md
@@ -0,0 +1,27 @@
+# React Best Practices
+
+This directory is a vendored copy of Vercel's React best-practices skill, kept in this repository for agent and developer use.
+
+## What is in this vendored copy
+
+- `SKILL.md` - the main skill entrypoint agents should follow.
+- `AGENTS.md` - the compiled guidance document included with the vendored copy.
+- `APPLICABILITY.md` - the **repo-specific override layer** for this repository's Vite + React frontend; use it to translate or suppress upstream Next.js- and SWR-specific guidance.
+- `rules/` - the vendored upstream rule source files included for reference.
+- `metadata.json` - upstream metadata and attribution.
+
+## How to use it here
+
+- Use this skill when working on React performance, rendering, bundle, and component-structure changes in `frontend/`.
+- Read `APPLICABILITY.md` alongside `SKILL.md` before applying rules mechanically.
+- Prefer framework-agnostic React guidance first; treat Next.js- or SWR-specific rules as conditional, adapted, or not applicable based on the applicability matrix.
+
+## Maintenance note
+
+This repository contains a repo-facing vendored copy, not the full upstream authoring/build workspace. Do not expect local regeneration assets such as `src/`, `pnpm` build scripts, or generated `test-cases.json` in this folder.
+
+If this skill needs to be refreshed, use the upstream Vercel-authored source as the source of truth for regeneration, then re-apply this repository's `APPLICABILITY.md` guidance.
+
+## Attribution
+
+Originally created by [@shuding](https://x.com/shuding) at [Vercel](https://vercel.com).
diff --git a/.github/skills/react-best-practices/SKILL.md b/.github/skills/react-best-practices/SKILL.md
new file mode 100644
index 0000000..de73301
--- /dev/null
+++ b/.github/skills/react-best-practices/SKILL.md
@@ -0,0 +1,147 @@
+---
+name: vercel-react-best-practices
+description: React performance optimization guidelines from Vercel Engineering, adapted for this repository's Vite React frontend. Use for React component, data-fetching, bundle, and rendering performance work; treat embedded Next.js and SWR guidance as conditional/reference-only unless those technologies are adopted here.
+license: MIT
+metadata:
+  author: vercel
+  version: "1.0.0"
+---
+
+# Vercel React Best Practices
+
+Comprehensive performance optimization guide for React and Next.js applications, maintained by Vercel. Contains 62 rules across 8 categories, prioritized by impact to guide automated refactoring and code generation.
+
+> **Repository applicability note:** This repository uses a Vite React frontend, not Next.js.
+> - Framework-agnostic React performance guidance applies normally.
+> - Next.js-specific rules (for example API routes, server actions, `next/dynamic`, and `after()`) should be treated as not applicable unless this repository adopts Next.js.
+> - SWR-specific rules are reference-only unless SWR is added to this repository.
+> - See `APPLICABILITY.md` for the repo-curated rule matrix and Vite/React substitutions.
+
+## When to Apply
+
+Reference these guidelines when:
+- Writing new React components or Next.js pages
+- Implementing data fetching (client or server-side)
+- Reviewing code for performance issues
+- Refactoring existing React/Next.js code
+- Optimizing bundle size or load times
+
+## Rule Categories by Priority
+
+| Priority | Category | Impact | Prefix |
+|----------|----------|--------|--------|
+| 1 | Eliminating Waterfalls | CRITICAL | `async-` |
+| 2 | Bundle Size Optimization | CRITICAL | `bundle-` |
+| 3 | Server-Side Performance | HIGH | `server-` |
+| 4 | Client-Side Data Fetching | MEDIUM-HIGH | `client-` |
+| 5 | Re-render Optimization | MEDIUM | `rerender-` |
+| 6 | Rendering Performance | MEDIUM | `rendering-` |
+| 7 | JavaScript Performance | LOW-MEDIUM | `js-` |
+| 8 | Advanced Patterns | LOW | `advanced-` |
+
+## Quick Reference
+
+### 1. Eliminating Waterfalls (CRITICAL)
+
+- `async-defer-await` - Move await into branches where actually used
+- `async-parallel` - Use Promise.all() for independent operations
+- `async-dependencies` - Use better-all for partial dependencies
+- `async-api-routes` - Start promises early, await late in API routes
+- `async-suspense-boundaries` - Use Suspense to stream content
+
+### 2. Bundle Size Optimization (CRITICAL)
+
+- `bundle-barrel-imports` - Import directly, avoid barrel files
+- `bundle-dynamic-imports` - Use next/dynamic for heavy components
+- `bundle-defer-third-party` - Load analytics/logging after hydration
+- `bundle-conditional` - Load modules only when feature is activated
+- `bundle-preload` - Preload on hover/focus for perceived speed
+
+### 3. Server-Side Performance (HIGH)
+
+- `server-auth-actions` - Authenticate server actions like API routes
+- `server-cache-react` - Use React.cache() for per-request deduplication
+- `server-cache-lru` - Use LRU cache for cross-request caching
+- `server-dedup-props` - Avoid duplicate serialization in RSC props
+- `server-hoist-static-io` - Hoist static I/O (fonts, logos) to module level
+- `server-serialization` - Minimize data passed to client components
+- `server-parallel-fetching` - Restructure components to parallelize fetches
+- `server-after-nonblocking` - Use after() for non-blocking operations
+
+### 4. Client-Side Data Fetching (MEDIUM-HIGH)
+
+- `client-swr-dedup` - Use SWR for automatic request deduplication
+- `client-event-listeners` - Deduplicate global event listeners
+- `client-passive-event-listeners` - Use passive listeners for scroll
+- `client-localstorage-schema` - Version and minimize localStorage data
+
+### 5. Re-render Optimization (MEDIUM)
+
+- `rerender-defer-reads` - Don't subscribe to state only used in callbacks
+- `rerender-memo` - Extract expensive work into memoized components
+- `rerender-memo-with-default-value` - Hoist default non-primitive props
+- `rerender-dependencies` - Use primitive dependencies in effects
+- `rerender-derived-state` - Subscribe to derived booleans, not raw values
+- `rerender-derived-state-no-effect` - Derive state during render, not effects
+- `rerender-functional-setstate` - Use functional setState for stable callbacks
+- `rerender-lazy-state-init` - Pass function to useState for expensive values
+- `rerender-simple-expression-in-memo` - Avoid memo for simple primitives
+- `rerender-move-effect-to-event` - Put interaction logic in event handlers
+- `rerender-transitions` - Use startTransition for non-urgent updates
+- `rerender-use-ref-transient-values` - Use refs for transient frequent values
+- `rerender-no-inline-components` - Don't define components inside components
+
+### 6. Rendering Performance (MEDIUM)
+
+- `rendering-animate-svg-wrapper` - Animate div wrapper, not SVG element
+- `rendering-content-visibility` - Use content-visibility for long lists
+- `rendering-hoist-jsx` - Extract static JSX outside components
+- `rendering-svg-precision` - Reduce SVG coordinate precision
+- `rendering-hydration-no-flicker` - Use inline script for client-only data
+- `rendering-hydration-suppress-warning` - Suppress expected mismatches
+- `rendering-activity` - Use Activity component for show/hide
+- `rendering-conditional-render` - Use ternary, not && for conditionals
+- `rendering-usetransition-loading` - Prefer useTransition for loading state
+- `rendering-resource-hints` - Use React DOM resource hints for preloading
+- `rendering-script-defer-async` - Use defer or async on script tags
+
+### 7. JavaScript Performance (LOW-MEDIUM)
+
+- `js-batch-dom-css` - Group CSS changes via classes or cssText
+- `js-index-maps` - Build Map for repeated lookups
+- `js-cache-property-access` - Cache object properties in loops
+- `js-cache-function-results` - Cache function results in module-level Map
+- `js-cache-storage` - Cache localStorage/sessionStorage reads
+- `js-combine-iterations` - Combine multiple filter/map into one loop
+- `js-length-check-first` - Check array length before expensive comparison
+- `js-early-exit` - Return early from functions
+- `js-hoist-regexp` - Hoist RegExp creation outside loops
+- `js-min-max-loop` - Use loop for min/max instead of sort
+- `js-set-map-lookups` - Use Set/Map for O(1) lookups
+- `js-tosorted-immutable` - Use toSorted() for immutability
+- `js-flatmap-filter` - Use flatMap to map and filter in one pass
+
+### 8. Advanced Patterns (LOW)
+
+- `advanced-event-handler-refs` - Store event handlers in refs
+- `advanced-init-once` - Initialize app once per app load
+- `advanced-use-latest` - useLatest for stable callback refs
+
+## How to Use
+
+Read individual rule files for detailed explanations and code examples:
+
+```
+rules/async-parallel.md
+rules/bundle-barrel-imports.md
+```
+
+Each rule file contains:
+- Brief explanation of why it matters
+- Incorrect code example with explanation
+- Correct code example with explanation
+- Additional context and references
+
+## Full Compiled Document
+
+For the complete guide with all rules expanded: `AGENTS.md`
diff --git a/.github/skills/react-best-practices/metadata.json b/.github/skills/react-best-practices/metadata.json
new file mode 100644
index 0000000..3bec38b
--- /dev/null
+++ b/.github/skills/react-best-practices/metadata.json
@@ -0,0 +1,15 @@
+{
+  "version": "1.0.0",
+  "organization": "Vercel Engineering",
+  "date": "January 2026",
+  "abstract": "Comprehensive performance optimization guide for React and Next.js applications, designed for AI agents and LLMs. Contains 40+ rules across 8 categories, prioritized by impact from critical (eliminating waterfalls, reducing bundle size) to incremental (advanced patterns). Each rule includes detailed explanations, real-world examples comparing incorrect vs. correct implementations, and specific impact metrics to guide automated refactoring and code generation.",
+  "references": [
+    "https://react.dev",
+    "https://nextjs.org",
+    "https://swr.vercel.app",
+    "https://github.com/shuding/better-all",
+    "https://github.com/isaacs/node-lru-cache",
+    "https://vercel.com/blog/how-we-optimized-package-imports-in-next-js",
+    "https://vercel.com/blog/how-we-made-the-vercel-dashboard-twice-as-fast"
+  ]
+}
diff --git a/.github/skills/react-best-practices/rules/_sections.md b/.github/skills/react-best-practices/rules/_sections.md
new file mode 100644
index 0000000..4d20c14
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/_sections.md
@@ -0,0 +1,46 @@
+# Sections
+
+This file defines all sections, their ordering, impact levels, and descriptions.
+The section ID (in parentheses) is the filename prefix used to group rules.
+
+---
+
+## 1. Eliminating Waterfalls (async)
+
+**Impact:** CRITICAL  
+**Description:** Waterfalls are the #1 performance killer. Each sequential await adds full network latency. Eliminating them yields the largest gains.
+
+## 2. Bundle Size Optimization (bundle)
+
+**Impact:** CRITICAL  
+**Description:** Reducing initial bundle size improves Time to Interactive and Largest Contentful Paint.
+
+## 3. Server-Side Performance (server)
+
+**Impact:** HIGH  
+**Description:** Optimizing server-side rendering and data fetching eliminates server-side waterfalls and reduces response times.
+
+## 4. Client-Side Data Fetching (client)
+
+**Impact:** MEDIUM-HIGH  
+**Description:** Automatic deduplication and efficient data fetching patterns reduce redundant network requests.
+
+## 5. Re-render Optimization (rerender)
+
+**Impact:** MEDIUM  
+**Description:** Reducing unnecessary re-renders minimizes wasted computation and improves UI responsiveness.
+
+## 6. Rendering Performance (rendering)
+
+**Impact:** MEDIUM  
+**Description:** Optimizing the rendering process reduces the work the browser needs to do.
+
+## 7. JavaScript Performance (js)
+
+**Impact:** LOW-MEDIUM  
+**Description:** Micro-optimizations for hot paths can add up to meaningful improvements.
+
+## 8. Advanced Patterns (advanced)
+
+**Impact:** LOW  
+**Description:** Advanced patterns for specific cases that require careful implementation.
diff --git a/.github/skills/react-best-practices/rules/_template.md b/.github/skills/react-best-practices/rules/_template.md
new file mode 100644
index 0000000..1e9e707
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/_template.md
@@ -0,0 +1,28 @@
+---
+title: Rule Title Here
+impact: MEDIUM
+impactDescription: Optional description of impact (e.g., "20-50% improvement")
+tags: tag1, tag2
+---
+
+## Rule Title Here
+
+**Impact: MEDIUM (optional impact description)**
+
+Brief explanation of the rule and why it matters. This should be clear and concise, explaining the performance implications.
+
+**Incorrect (description of what's wrong):**
+
+```typescript
+// Bad code example here
+const bad = example()
+```
+
+**Correct (description of what's right):**
+
+```typescript
+// Good code example here
+const good = example()
+```
+
+Reference: [Link to documentation or resource](https://example.com)
diff --git a/.github/skills/react-best-practices/rules/advanced-event-handler-refs.md b/.github/skills/react-best-practices/rules/advanced-event-handler-refs.md
new file mode 100644
index 0000000..97e7ade
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/advanced-event-handler-refs.md
@@ -0,0 +1,55 @@
+---
+title: Store Event Handlers in Refs
+impact: LOW
+impactDescription: stable subscriptions
+tags: advanced, hooks, refs, event-handlers, optimization
+---
+
+## Store Event Handlers in Refs
+
+Store callbacks in refs when used in effects that shouldn't re-subscribe on callback changes.
+
+**Incorrect (re-subscribes on every render):**
+
+```tsx
+function useWindowEvent(event: string, handler: (e) => void) {
+  useEffect(() => {
+    window.addEventListener(event, handler)
+    return () => window.removeEventListener(event, handler)
+  }, [event, handler])
+}
+```
+
+**Correct (stable subscription):**
+
+```tsx
+function useWindowEvent(event: string, handler: (e) => void) {
+  const handlerRef = useRef(handler)
+  useEffect(() => {
+    handlerRef.current = handler
+  }, [handler])
+
+  useEffect(() => {
+    const listener = (e) => handlerRef.current(e)
+    window.addEventListener(event, listener)
+    return () => window.removeEventListener(event, listener)
+  }, [event])
+}
+```
+
+**Alternative: use `useEffectEvent` if you're on latest React:**
+
+```tsx
+import { useEffectEvent } from 'react'
+
+function useWindowEvent(event: string, handler: (e) => void) {
+  const onEvent = useEffectEvent(handler)
+
+  useEffect(() => {
+    window.addEventListener(event, onEvent)
+    return () => window.removeEventListener(event, onEvent)
+  }, [event])
+}
+```
+
+`useEffectEvent` provides a cleaner API for the same pattern: it creates a stable function reference that always calls the latest version of the handler.
diff --git a/.github/skills/react-best-practices/rules/advanced-init-once.md b/.github/skills/react-best-practices/rules/advanced-init-once.md
new file mode 100644
index 0000000..73ee38e
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/advanced-init-once.md
@@ -0,0 +1,42 @@
+---
+title: Initialize App Once, Not Per Mount
+impact: LOW-MEDIUM
+impactDescription: avoids duplicate init in development
+tags: initialization, useEffect, app-startup, side-effects
+---
+
+## Initialize App Once, Not Per Mount
+
+Do not put app-wide initialization that must run once per app load inside `useEffect([])` of a component. Components can remount and effects will re-run. Use a module-level guard or top-level init in the entry module instead.
+
+**Incorrect (runs twice in dev, re-runs on remount):**
+
+```tsx
+function Comp() {
+  useEffect(() => {
+    loadFromStorage()
+    checkAuthToken()
+  }, [])
+
+  // ...
+}
+```
+
+**Correct (once per app load):**
+
+```tsx
+let didInit = false
+
+function Comp() {
+  useEffect(() => {
+    if (didInit) return
+    didInit = true
+    loadFromStorage()
+    checkAuthToken()
+  }, [])
+
+  // ...
+}
+```
+
+Reference: [Initializing the application](https://react.dev/learn/you-might-not-need-an-effect#initializing-the-application)
diff --git a/.github/skills/react-best-practices/rules/advanced-use-latest.md b/.github/skills/react-best-practices/rules/advanced-use-latest.md
new file mode 100644
index 0000000..9c7cb50
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/advanced-use-latest.md
@@ -0,0 +1,39 @@
+---
+title: useEffectEvent for Stable Callback Refs
+impact: LOW
+impactDescription: prevents effect re-runs
+tags: advanced, hooks, useEffectEvent, refs, optimization
+---
+
+## useEffectEvent for Stable Callback Refs
+
+Access latest values in callbacks without adding them to dependency arrays. Prevents effect re-runs while avoiding stale closures.
+
+**Incorrect (effect re-runs on every callback change):**
+
+```tsx
+function SearchInput({ onSearch }: { onSearch: (q: string) => void }) {
+  const [query, setQuery] = useState('')
+
+  useEffect(() => {
+    const timeout = setTimeout(() => onSearch(query), 300)
+    return () => clearTimeout(timeout)
+  }, [query, onSearch])
+}
+```
+
+**Correct (using React's useEffectEvent):**
+
+```tsx
+import { useEffectEvent } from 'react';
+
+function SearchInput({ onSearch }: { onSearch: (q: string) => void }) {
+  const [query, setQuery] = useState('')
+  const onSearchEvent = useEffectEvent(onSearch)
+
+  useEffect(() => {
+    const timeout = setTimeout(() => onSearchEvent(query), 300)
+    return () => clearTimeout(timeout)
+  }, [query])
+}
+```
diff --git a/.github/skills/react-best-practices/rules/async-api-routes.md b/.github/skills/react-best-practices/rules/async-api-routes.md
new file mode 100644
index 0000000..6feda1e
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/async-api-routes.md
@@ -0,0 +1,38 @@
+---
+title: Prevent Waterfall Chains in API Routes
+impact: CRITICAL
+impactDescription: 2-10× improvement
+tags: api-routes, server-actions, waterfalls, parallelization
+---
+
+## Prevent Waterfall Chains in API Routes
+
+In API routes and Server Actions, start independent operations immediately, even if you don't await them yet.
+
+**Incorrect (config waits for auth, data waits for both):**
+
+```typescript
+export async function GET(request: Request) {
+  const session = await auth()
+  const config = await fetchConfig()
+  const data = await fetchData(session.user.id)
+  return Response.json({ data, config })
+}
+```
+
+**Correct (auth and config start immediately):**
+
+```typescript
+export async function GET(request: Request) {
+  const sessionPromise = auth()
+  const configPromise = fetchConfig()
+  const session = await sessionPromise
+  const [config, data] = await Promise.all([
+    configPromise,
+    fetchData(session.user.id)
+  ])
+  return Response.json({ data, config })
+}
+```
+
+For operations with more complex dependency chains, use `better-all` to automatically maximize parallelism (see Dependency-Based Parallelization).
diff --git a/.github/skills/react-best-practices/rules/async-defer-await.md b/.github/skills/react-best-practices/rules/async-defer-await.md
new file mode 100644
index 0000000..ea7082a
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/async-defer-await.md
@@ -0,0 +1,80 @@
+---
+title: Defer Await Until Needed
+impact: HIGH
+impactDescription: avoids blocking unused code paths
+tags: async, await, conditional, optimization
+---
+
+## Defer Await Until Needed
+
+Move `await` operations into the branches where they're actually used to avoid blocking code paths that don't need them.
+
+**Incorrect (blocks both branches):**
+
+```typescript
+async function handleRequest(userId: string, skipProcessing: boolean) {
+  const userData = await fetchUserData(userId)
+  
+  if (skipProcessing) {
+    // Returns immediately but still waited for userData
+    return { skipped: true }
+  }
+  
+  // Only this branch uses userData
+  return processUserData(userData)
+}
+```
+
+**Correct (only blocks when needed):**
+
+```typescript
+async function handleRequest(userId: string, skipProcessing: boolean) {
+  if (skipProcessing) {
+    // Returns immediately without waiting
+    return { skipped: true }
+  }
+  
+  // Fetch only when needed
+  const userData = await fetchUserData(userId)
+  return processUserData(userData)
+}
+```
+
+**Another example (early return optimization):**
+
+```typescript
+// Incorrect: always fetches permissions
+async function updateResource(resourceId: string, userId: string) {
+  const permissions = await fetchPermissions(userId)
+  const resource = await getResource(resourceId)
+  
+  if (!resource) {
+    return { error: 'Not found' }
+  }
+  
+  if (!permissions.canEdit) {
+    return { error: 'Forbidden' }
+  }
+  
+  return await updateResourceData(resource, permissions)
+}
+
+// Correct: fetches only when needed
+async function updateResource(resourceId: string, userId: string) {
+  const resource = await getResource(resourceId)
+  
+  if (!resource) {
+    return { error: 'Not found' }
+  }
+  
+  const permissions = await fetchPermissions(userId)
+  
+  if (!permissions.canEdit) {
+    return { error: 'Forbidden' }
+  }
+  
+  return await updateResourceData(resource, permissions)
+}
+```
+
+This optimization is especially valuable when the skipped branch is frequently taken, or when the deferred operation is expensive.
diff --git a/.github/skills/react-best-practices/rules/async-dependencies.md b/.github/skills/react-best-practices/rules/async-dependencies.md
new file mode 100644
index 0000000..0484eba
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/async-dependencies.md
@@ -0,0 +1,51 @@
+---
+title: Dependency-Based Parallelization
+impact: CRITICAL
+impactDescription: 2-10× improvement
+tags: async, parallelization, dependencies, better-all
+---
+
+## Dependency-Based Parallelization
+
+For operations with partial dependencies, use `better-all` to maximize parallelism. It automatically starts each task at the earliest possible moment.
+
+**Incorrect (profile waits for config unnecessarily):**
+
+```typescript
+const [user, config] = await Promise.all([
+  fetchUser(),
+  fetchConfig()
+])
+const profile = await fetchProfile(user.id)
+```
+
+**Correct (config and profile run in parallel):**
+
+```typescript
+import { all } from 'better-all'
+
+const { user, config, profile } = await all({
+  async user() { return fetchUser() },
+  async config() { return fetchConfig() },
+  async profile() {
+    return fetchProfile((await this.$.user).id)
+  }
+})
+```
+
+**Alternative without extra dependencies:**
+
+We can also create all the promises first, and do `Promise.all()` at the end.
+
+```typescript
+const userPromise = fetchUser()
+const profilePromise = userPromise.then(user => fetchProfile(user.id))
+
+const [user, config, profile] = await Promise.all([
+  userPromise,
+  fetchConfig(),
+  profilePromise
+])
+```
+
+Reference: [https://github.com/shuding/better-all](https://github.com/shuding/better-all)
diff --git a/.github/skills/react-best-practices/rules/async-parallel.md b/.github/skills/react-best-practices/rules/async-parallel.md
new file mode 100644
index 0000000..64133f6
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/async-parallel.md
@@ -0,0 +1,28 @@
+---
+title: Promise.all() for Independent Operations
+impact: CRITICAL
+impactDescription: 2-10× improvement
+tags: async, parallelization, promises, waterfalls
+---
+
+## Promise.all() for Independent Operations
+
+When async operations have no interdependencies, execute them concurrently using `Promise.all()`.
+
+**Incorrect (sequential execution, 3 round trips):**
+
+```typescript
+const user = await fetchUser()
+const posts = await fetchPosts()
+const comments = await fetchComments()
+```
+
+**Correct (parallel execution, 1 round trip):**
+
+```typescript
+const [user, posts, comments] = await Promise.all([
+  fetchUser(),
+  fetchPosts(),
+  fetchComments()
+])
+```
diff --git a/.github/skills/react-best-practices/rules/async-suspense-boundaries.md b/.github/skills/react-best-practices/rules/async-suspense-boundaries.md
new file mode 100644
index 0000000..1fbc05b
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/async-suspense-boundaries.md
@@ -0,0 +1,99 @@
+---
+title: Strategic Suspense Boundaries
+impact: HIGH
+impactDescription: faster initial paint
+tags: async, suspense, streaming, layout-shift
+---
+
+## Strategic Suspense Boundaries
+
+Instead of awaiting data in async components before returning JSX, use Suspense boundaries to show the wrapper UI faster while data loads.
+
+**Incorrect (wrapper blocked by data fetching):**
+
+```tsx
+async function Page() {
+  const data = await fetchData() // Blocks entire page
+  
+  return (
+    <div>
+      <div>Sidebar</div>
+      <div>Header</div>
+      <div>
+        <DataDisplay data={data} />
+      </div>
+      <div>Footer</div>
+    </div>
+  )
+}
+```
+
+The entire layout waits for data even though only the middle section needs it.
+
+**Correct (wrapper shows immediately, data streams in):**
+
+```tsx
+function Page() {
+  return (
+    <div>
+      <div>Sidebar</div>
+      <div>Header</div>
+      <div>
+        <Suspense fallback={<Skeleton />}>
+          <DataDisplay />
+        </Suspense>
+      </div>
+      <div>Footer</div>
+    </div>
+  )
+}
+
+async function DataDisplay() {
+  const data = await fetchData() // Only blocks this component
+  return <div>{data.content}</div>
+}
+```
+
+Sidebar, Header, and Footer render immediately. Only DataDisplay waits for data.
+
+**Alternative (share promise across components):**
+
+```tsx
+function Page() {
+  // Start fetch immediately, but don't await
+  const dataPromise = fetchData()
+  
+  return (
+    <div>
+      <div>Sidebar</div>
+      <div>Header</div>
+      <Suspense fallback={<Skeleton />}>
+        <DataDisplay dataPromise={dataPromise} />
+        <DataSummary dataPromise={dataPromise} />
+      </Suspense>
+      <div>Footer</div>
+    </div>
+  )
+}
+
+function DataDisplay({ dataPromise }: { dataPromise: Promise<Data> }) {
+  const data = use(dataPromise) // Unwraps the promise
+  return <div>{data.content}</div>
+}
+
+function DataSummary({ dataPromise }: { dataPromise: Promise<Data> }) {
+  const data = use(dataPromise) // Reuses the same promise
+  return <div>{data.summary}</div>
+}
+```
+
+Both components share the same promise, so only one fetch occurs. Layout renders immediately while both components wait together.
+
+**When NOT to use this pattern:**
+
+- Critical data needed for layout decisions (affects positioning)
+- SEO-critical content above the fold
+- Small, fast queries where suspense overhead isn't worth it
+- When you want to avoid layout shift (loading → content jump)
+
+**Trade-off:** Faster initial paint vs potential layout shift. Choose based on your UX priorities.
diff --git a/.github/skills/react-best-practices/rules/bundle-barrel-imports.md b/.github/skills/react-best-practices/rules/bundle-barrel-imports.md
new file mode 100644
index 0000000..ee48f32
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/bundle-barrel-imports.md
@@ -0,0 +1,59 @@
+---
+title: Avoid Barrel File Imports
+impact: CRITICAL
+impactDescription: 200-800ms import cost, slow builds
+tags: bundle, imports, tree-shaking, barrel-files, performance
+---
+
+## Avoid Barrel File Imports
+
+Import directly from source files instead of barrel files to avoid loading thousands of unused modules. **Barrel files** are entry points that re-export multiple modules (e.g., `index.js` that does `export * from './module'`).
+
+Popular icon and component libraries can have **up to 10,000 re-exports** in their entry file. For many React packages, **it takes 200-800ms just to import them**, affecting both development speed and production cold starts.
+
+**Why tree-shaking doesn't help:** When a library is marked as external (not bundled), the bundler can't optimize it. If you bundle it to enable tree-shaking, builds become substantially slower analyzing the entire module graph.
+
+**Incorrect (imports entire library):**
+
+```tsx
+import { Check, X, Menu } from 'lucide-react'
+// Loads 1,583 modules, takes ~2.8s extra in dev
+// Runtime cost: 200-800ms on every cold start
+
+import { Button, TextField } from '@mui/material'
+// Loads 2,225 modules, takes ~4.2s extra in dev
+```
+
+**Correct (imports only what you need):**
+
+```tsx
+import Check from 'lucide-react/dist/esm/icons/check'
+import X from 'lucide-react/dist/esm/icons/x'
+import Menu from 'lucide-react/dist/esm/icons/menu'
+// Loads only 3 modules (~2KB vs ~1MB)
+
+import Button from '@mui/material/Button'
+import TextField from '@mui/material/TextField'
+// Loads only what you use
+```
+
+**Alternative (Next.js 13.5+):**
+
+```js
+// next.config.js - use optimizePackageImports
+module.exports = {
+  experimental: {
+    optimizePackageImports: ['lucide-react', '@mui/material']
+  }
+}
+
+// Then you can keep the ergonomic barrel imports:
+import { Check, X, Menu } from 'lucide-react'
+// Automatically transformed to direct imports at build time
+```
+
+Direct imports provide 15-70% faster dev boot, 28% faster builds, 40% faster cold starts, and significantly faster HMR.
+
+Libraries commonly affected: `lucide-react`, `@mui/material`, `@mui/icons-material`, `@tabler/icons-react`, `react-icons`, `@headlessui/react`, `@radix-ui/react-*`, `lodash`, `ramda`, `date-fns`, `rxjs`, `react-use`.
+
+Reference: [How we optimized package imports in Next.js](https://vercel.com/blog/how-we-optimized-package-imports-in-next-js)
diff --git a/.github/skills/react-best-practices/rules/bundle-conditional.md b/.github/skills/react-best-practices/rules/bundle-conditional.md
new file mode 100644
index 0000000..99d6fc9
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/bundle-conditional.md
@@ -0,0 +1,31 @@
+---
+title: Conditional Module Loading
+impact: HIGH
+impactDescription: loads large data only when needed
+tags: bundle, conditional-loading, lazy-loading
+---
+
+## Conditional Module Loading
+
+Load large data or modules only when a feature is activated.
+
+**Example (lazy-load animation frames):**
+
+```tsx
+function AnimationPlayer({ enabled, setEnabled }: { enabled: boolean; setEnabled: React.Dispatch<React.SetStateAction<boolean>> }) {
+  const [frames, setFrames] = useState<Frame[] | null>(null)
+
+  useEffect(() => {
+    if (enabled && !frames && typeof window !== 'undefined') {
+      import('./animation-frames.js')
+        .then(mod => setFrames(mod.frames))
+        .catch(() => setEnabled(false))
+    }
+  }, [enabled, frames, setEnabled])
+
+  if (!frames) return <Skeleton />
+  return <Canvas frames={frames} />
+}
+```
+
+The `typeof window !== 'undefined'` check prevents bundling this module for SSR, optimizing server bundle size and build speed.
diff --git a/.github/skills/react-best-practices/rules/bundle-defer-third-party.md b/.github/skills/react-best-practices/rules/bundle-defer-third-party.md
new file mode 100644
index 0000000..db041d1
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/bundle-defer-third-party.md
@@ -0,0 +1,49 @@
+---
+title: Defer Non-Critical Third-Party Libraries
+impact: MEDIUM
+impactDescription: loads after hydration
+tags: bundle, third-party, analytics, defer
+---
+
+## Defer Non-Critical Third-Party Libraries
+
+Analytics, logging, and error tracking don't block user interaction. Load them after hydration.
+
+**Incorrect (blocks initial bundle):**
+
+```tsx
+import { Analytics } from '@vercel/analytics/react'
+
+export default function RootLayout({ children }) {
+  return (
+    <html>
+      <body>
+        {children}
+        <Analytics />
+      </body>
+    </html>
+  )
+}
+```
+
+**Correct (loads after hydration):**
+
+```tsx
+import dynamic from 'next/dynamic'
+
+const Analytics = dynamic(
+  () => import('@vercel/analytics/react').then(m => m.Analytics),
+  { ssr: false }
+)
+
+export default function RootLayout({ children }) {
+  return (
+    <html>
+      <body>
+        {children}
+        <Analytics />
+      </body>
+    </html>
+  )
+}
+```
diff --git a/.github/skills/react-best-practices/rules/bundle-dynamic-imports.md b/.github/skills/react-best-practices/rules/bundle-dynamic-imports.md
new file mode 100644
index 0000000..60b6269
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/bundle-dynamic-imports.md
@@ -0,0 +1,35 @@
+---
+title: Dynamic Imports for Heavy Components
+impact: CRITICAL
+impactDescription: directly affects TTI and LCP
+tags: bundle, dynamic-import, code-splitting, next-dynamic
+---
+
+## Dynamic Imports for Heavy Components
+
+Use `next/dynamic` to lazy-load large components not needed on initial render.
+
+**Incorrect (Monaco bundles with main chunk ~300KB):**
+
+```tsx
+import { MonacoEditor } from './monaco-editor'
+
+function CodePanel({ code }: { code: string }) {
+  return <MonacoEditor value={code} />
+}
+```
+
+**Correct (Monaco loads on demand):**
+
+```tsx
+import dynamic from 'next/dynamic'
+
+const MonacoEditor = dynamic(
+  () => import('./monaco-editor').then(m => m.MonacoEditor),
+  { ssr: false }
+)
+
+function CodePanel({ code }: { code: string }) {
+  return <MonacoEditor value={code} />
+}
+```
diff --git a/.github/skills/react-best-practices/rules/bundle-preload.md b/.github/skills/react-best-practices/rules/bundle-preload.md
new file mode 100644
index 0000000..7000504
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/bundle-preload.md
@@ -0,0 +1,50 @@
+---
+title: Preload Based on User Intent
+impact: MEDIUM
+impactDescription: reduces perceived latency
+tags: bundle, preload, user-intent, hover
+---
+
+## Preload Based on User Intent
+
+Preload heavy bundles before they're needed to reduce perceived latency.
+
+**Example (preload on hover/focus):**
+
+```tsx
+function EditorButton({ onClick }: { onClick: () => void }) {
+  const preload = () => {
+    if (typeof window !== 'undefined') {
+      void import('./monaco-editor')
+    }
+  }
+
+  return (
+    <button
+      onMouseEnter={preload}
+      onFocus={preload}
+      onClick={onClick}
+    >
+      Open Editor
+    </button>
+  )
+}
+```
+
+**Example (preload when feature flag is enabled):**
+
+```tsx
+function FlagsProvider({ children, flags }: Props) {
+  useEffect(() => {
+    if (flags.editorEnabled && typeof window !== 'undefined') {
+      void import('./monaco-editor').then(mod => mod.init())
+    }
+  }, [flags.editorEnabled])
+
+  return <FlagsContext.Provider value={flags}>
+    {children}
+  </FlagsContext.Provider>
+}
+```
+
+The `typeof window !== 'undefined'` check prevents bundling preloaded modules for SSR, optimizing server bundle size and build speed.
diff --git a/.github/skills/react-best-practices/rules/client-event-listeners.md b/.github/skills/react-best-practices/rules/client-event-listeners.md
new file mode 100644
index 0000000..aad4ae9
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/client-event-listeners.md
@@ -0,0 +1,74 @@
+---
+title: Deduplicate Global Event Listeners
+impact: LOW
+impactDescription: single listener for N components
+tags: client, swr, event-listeners, subscription
+---
+
+## Deduplicate Global Event Listeners
+
+Use `useSWRSubscription()` to share global event listeners across component instances.
+
+**Incorrect (N instances = N listeners):**
+
+```tsx
+function useKeyboardShortcut(key: string, callback: () => void) {
+  useEffect(() => {
+    const handler = (e: KeyboardEvent) => {
+      if (e.metaKey && e.key === key) {
+        callback()
+      }
+    }
+    window.addEventListener('keydown', handler)
+    return () => window.removeEventListener('keydown', handler)
+  }, [key, callback])
+}
+```
+
+When using the `useKeyboardShortcut` hook multiple times, each instance will register a new listener.
+
+**Correct (N instances = 1 listener):**
+
+```tsx
+import useSWRSubscription from 'swr/subscription'
+
+// Module-level Map to track callbacks per key
+const keyCallbacks = new Map<string, Set<() => void>>()
+
+function useKeyboardShortcut(key: string, callback: () => void) {
+  // Register this callback in the Map
+  useEffect(() => {
+    if (!keyCallbacks.has(key)) {
+      keyCallbacks.set(key, new Set())
+    }
+    keyCallbacks.get(key)!.add(callback)
+
+    return () => {
+      const set = keyCallbacks.get(key)
+      if (set) {
+        set.delete(callback)
+        if (set.size === 0) {
+          keyCallbacks.delete(key)
+        }
+      }
+    }
+  }, [key, callback])
+
+  useSWRSubscription('global-keydown', () => {
+    const handler = (e: KeyboardEvent) => {
+      if (e.metaKey && keyCallbacks.has(e.key)) {
+        keyCallbacks.get(e.key)!.forEach(cb => cb())
+      }
+    }
+    window.addEventListener('keydown', handler)
+    return () => window.removeEventListener('keydown', handler)
+  })
+}
+
+function Profile() {
+  // Multiple shortcuts will share the same listener
+  useKeyboardShortcut('p', () => { /* ... */ }) 
+  useKeyboardShortcut('k', () => { /* ... */ })
+  // ...
+}
+```
diff --git a/.github/skills/react-best-practices/rules/client-localstorage-schema.md b/.github/skills/react-best-practices/rules/client-localstorage-schema.md
new file mode 100644
index 0000000..d30a1a7
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/client-localstorage-schema.md
@@ -0,0 +1,71 @@
+---
+title: Version and Minimize localStorage Data
+impact: MEDIUM
+impactDescription: prevents schema conflicts, reduces storage size
+tags: client, localStorage, storage, versioning, data-minimization
+---
+
+## Version and Minimize localStorage Data
+
+Add version prefix to keys and store only needed fields. Prevents schema conflicts and accidental storage of sensitive data.
+
+**Incorrect:**
+
+```typescript
+// No version, stores everything, no error handling
+localStorage.setItem('userConfig', JSON.stringify(fullUserObject))
+const data = localStorage.getItem('userConfig')
+```
+
+**Correct:**
+
+```typescript
+const VERSION = 'v2'
+
+function saveConfig(config: { theme: string; language: string }) {
+  try {
+    localStorage.setItem(`userConfig:${VERSION}`, JSON.stringify(config))
+  } catch {
+    // Throws in incognito/private browsing, quota exceeded, or disabled
+  }
+}
+
+function loadConfig() {
+  try {
+    const data = localStorage.getItem(`userConfig:${VERSION}`)
+    return data ? JSON.parse(data) : null
+  } catch {
+    return null
+  }
+}
+
+// Migration from v1 to v2
+function migrate() {
+  try {
+    const v1 = localStorage.getItem('userConfig:v1')
+    if (v1) {
+      const old = JSON.parse(v1)
+      saveConfig({ theme: old.darkMode ? 'dark' : 'light', language: old.lang })
+      localStorage.removeItem('userConfig:v1')
+    }
+  } catch {}
+}
+```
+
+**Store minimal fields from server responses:**
+
+```typescript
+// User object has 20+ fields, only store what UI needs
+function cachePrefs(user: FullUser) {
+  try {
+    localStorage.setItem('prefs:v1', JSON.stringify({
+      theme: user.preferences.theme,
+      notifications: user.preferences.notifications
+    }))
+  } catch {}
+}
+```
+
+**Always wrap in try-catch:** `getItem()` and `setItem()` throw in incognito/private browsing (Safari, Firefox), when quota exceeded, or when disabled.
+
+**Benefits:** Schema evolution via versioning, reduced storage size, prevents storing tokens/PII/internal flags.
diff --git a/.github/skills/react-best-practices/rules/client-passive-event-listeners.md b/.github/skills/react-best-practices/rules/client-passive-event-listeners.md
new file mode 100644
index 0000000..ce39a88
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/client-passive-event-listeners.md
@@ -0,0 +1,48 @@
+---
+title: Use Passive Event Listeners for Scrolling Performance
+impact: MEDIUM
+impactDescription: eliminates scroll delay caused by event listeners
+tags: client, event-listeners, scrolling, performance, touch, wheel
+---
+
+## Use Passive Event Listeners for Scrolling Performance
+
+Add `{ passive: true }` to touch and wheel event listeners to enable immediate scrolling. Browsers normally wait for listeners to finish to check if `preventDefault()` is called, causing scroll delay.
+
+**Incorrect:**
+
+```typescript
+useEffect(() => {
+  const handleTouch = (e: TouchEvent) => console.log(e.touches[0].clientX)
+  const handleWheel = (e: WheelEvent) => console.log(e.deltaY)
+  
+  document.addEventListener('touchstart', handleTouch)
+  document.addEventListener('wheel', handleWheel)
+  
+  return () => {
+    document.removeEventListener('touchstart', handleTouch)
+    document.removeEventListener('wheel', handleWheel)
+  }
+}, [])
+```
+
+**Correct:**
+
+```typescript
+useEffect(() => {
+  const handleTouch = (e: TouchEvent) => console.log(e.touches[0].clientX)
+  const handleWheel = (e: WheelEvent) => console.log(e.deltaY)
+  
+  document.addEventListener('touchstart', handleTouch, { passive: true })
+  document.addEventListener('wheel', handleWheel, { passive: true })
+  
+  return () => {
+    document.removeEventListener('touchstart', handleTouch)
+    document.removeEventListener('wheel', handleWheel)
+  }
+}, [])
+```
+
+**Use passive when:** tracking/analytics, logging, any listener that doesn't call `preventDefault()`.
+
+**Don't use passive when:** implementing custom swipe gestures, custom zoom controls, or any listener that needs `preventDefault()`.
diff --git a/.github/skills/react-best-practices/rules/client-swr-dedup.md b/.github/skills/react-best-practices/rules/client-swr-dedup.md
new file mode 100644
index 0000000..2a430f2
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/client-swr-dedup.md
@@ -0,0 +1,56 @@
+---
+title: Use SWR for Automatic Deduplication
+impact: MEDIUM-HIGH
+impactDescription: automatic deduplication
+tags: client, swr, deduplication, data-fetching
+---
+
+## Use SWR for Automatic Deduplication
+
+SWR enables request deduplication, caching, and revalidation across component instances.
+
+**Incorrect (no deduplication, each instance fetches):**
+
+```tsx
+function UserList() {
+  const [users, setUsers] = useState([])
+  useEffect(() => {
+    fetch('/api/users')
+      .then(r => r.json())
+      .then(setUsers)
+  }, [])
+}
+```
+
+**Correct (multiple instances share one request):**
+
+```tsx
+import useSWR from 'swr'
+
+function UserList() {
+  const { data: users } = useSWR('/api/users', fetcher)
+}
+```
+
+**For immutable data:**
+
+```tsx
+import { useImmutableSWR } from '@/lib/swr'
+
+function StaticContent() {
+  const { data } = useImmutableSWR('/api/config', fetcher)
+}
+```
+
+**For mutations:**
+
+```tsx
+import { useSWRMutation } from 'swr/mutation'
+
+function UpdateButton() {
+  const { trigger } = useSWRMutation('/api/user', updateUser)
+  return <button onClick={() => trigger()}>Update</button>
+}
+```
+
+Reference: [https://swr.vercel.app](https://swr.vercel.app)
diff --git a/.github/skills/react-best-practices/rules/js-batch-dom-css.md b/.github/skills/react-best-practices/rules/js-batch-dom-css.md
new file mode 100644
index 0000000..a62d84e
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-batch-dom-css.md
@@ -0,0 +1,107 @@
+---
+title: Avoid Layout Thrashing
+impact: MEDIUM
+impactDescription: prevents forced synchronous layouts and reduces performance bottlenecks
+tags: javascript, dom, css, performance, reflow, layout-thrashing
+---
+
+## Avoid Layout Thrashing
+
+Avoid interleaving style writes with layout reads. When you read a layout property (like `offsetWidth`, `getBoundingClientRect()`, or `getComputedStyle()`) between style changes, the browser is forced to trigger a synchronous reflow.
+
+**This is OK (browser batches style changes):**
+```typescript
+function updateElementStyles(element: HTMLElement) {
+  // Each line invalidates style, but browser batches the recalculation
+  element.style.width = '100px'
+  element.style.height = '200px'
+  element.style.backgroundColor = 'blue'
+  element.style.border = '1px solid black'
+}
+```
+
+**Incorrect (interleaved reads and writes force reflows):**
+```typescript
+function layoutThrashing(element: HTMLElement) {
+  element.style.width = '100px'
+  const width = element.offsetWidth  // Forces reflow
+  element.style.height = '200px'
+  const height = element.offsetHeight  // Forces another reflow
+}
+```
+
+**Correct (batch writes, then read once):**
+```typescript
+function updateElementStyles(element: HTMLElement) {
+  // Batch all writes together
+  element.style.width = '100px'
+  element.style.height = '200px'
+  element.style.backgroundColor = 'blue'
+  element.style.border = '1px solid black'
+  
+  // Read after all writes are done (single reflow)
+  const { width, height } = element.getBoundingClientRect()
+}
+```
+
+**Correct (batch reads, then writes):**
+```typescript
+function avoidThrashing(element: HTMLElement) {
+  // Read phase - all layout queries first
+  const rect1 = element.getBoundingClientRect()
+  const offsetWidth = element.offsetWidth
+  const offsetHeight = element.offsetHeight
+  
+  // Write phase - all style changes after
+  element.style.width = '100px'
+  element.style.height = '200px'
+}
+```
+
+**Better: use CSS classes**
+```css
+.highlighted-box {
+  width: 100px;
+  height: 200px;
+  background-color: blue;
+  border: 1px solid black;
+}
+```
+```typescript
+function updateElementStyles(element: HTMLElement) {
+  element.classList.add('highlighted-box')
+  
+  const { width, height } = element.getBoundingClientRect()
+}
+```
+
+**React example:**
+```tsx
+// Incorrect: interleaving style changes with layout queries
+function Box({ isHighlighted }: { isHighlighted: boolean }) {
+  const ref = useRef<HTMLDivElement>(null)
+  
+  useEffect(() => {
+    if (ref.current && isHighlighted) {
+      ref.current.style.width = '100px'
+      const width = ref.current.offsetWidth // Forces layout
+      ref.current.style.height = '200px'
+    }
+  }, [isHighlighted])
+  
+  return <div ref={ref}>Content</div>
+}
+
+// Correct: toggle class
+function Box({ isHighlighted }: { isHighlighted: boolean }) {
+  return (
+    <div className={isHighlighted ? 'highlighted-box' : ''}>
+      Content
+    </div>
+  )
+}
+```
+
+Prefer CSS classes over inline styles when possible. CSS files are cached by the browser, and classes provide better separation of concerns and are easier to maintain.
+
+See [this gist](https://gist.github.com/paulirish/5d52fb081b3570c81e3a) and [CSS Triggers](https://csstriggers.com/) for more information on layout-forcing operations.
diff --git a/.github/skills/react-best-practices/rules/js-cache-function-results.md b/.github/skills/react-best-practices/rules/js-cache-function-results.md
new file mode 100644
index 0000000..180f8ac
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-cache-function-results.md
@@ -0,0 +1,80 @@
+---
+title: Cache Repeated Function Calls
+impact: MEDIUM
+impactDescription: avoid redundant computation
+tags: javascript, cache, memoization, performance
+---
+
+## Cache Repeated Function Calls
+
+Use a module-level Map to cache function results when the same function is called repeatedly with the same inputs during render.
+
+**Incorrect (redundant computation):**
+
+```typescript
+function ProjectList({ projects }: { projects: Project[] }) {
+  return (
+    <div>
+      {projects.map(project => {
+        // slugify() called 100+ times for same project names
+        const slug = slugify(project.name)
+        
+        return <ProjectCard key={project.id} slug={slug} />
+      })}
+    </div>
+  )
+}
+```
+
+**Correct (cached results):**
+
+```typescript
+// Module-level cache
+const slugifyCache = new Map<string, string>()
+
+function cachedSlugify(text: string): string {
+  if (slugifyCache.has(text)) {
+    return slugifyCache.get(text)!
+  }
+  const result = slugify(text)
+  slugifyCache.set(text, result)
+  return result
+}
+
+function ProjectList({ projects }: { projects: Project[] }) {
+  return (
+    <div>
+      {projects.map(project => {
+        // Computed only once per unique project name
+        const slug = cachedSlugify(project.name)
+        
+        return <ProjectCard key={project.id} slug={slug} />
+      })}
+    </div>
+  )
+}
+```
+
+**Simpler pattern for single-value functions:**
+
+```typescript
+let isLoggedInCache: boolean | null = null
+
+function isLoggedIn(): boolean {
+  if (isLoggedInCache !== null) {
+    return isLoggedInCache
+  }
+  
+  isLoggedInCache = document.cookie.includes('auth=')
+  return isLoggedInCache
+}
+
+// Clear cache when auth changes
+function onAuthChange() {
+  isLoggedInCache = null
+}
+```
+
+Use a Map (not a hook) so it works everywhere: utilities, event handlers, not just React components.
+
+Reference: [How we made the Vercel Dashboard twice as fast](https://vercel.com/blog/how-we-made-the-vercel-dashboard-twice-as-fast)
diff --git a/.github/skills/react-best-practices/rules/js-cache-property-access.md b/.github/skills/react-best-practices/rules/js-cache-property-access.md
new file mode 100644
index 0000000..39eec90
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-cache-property-access.md
@@ -0,0 +1,28 @@
+---
+title: Cache Property Access in Loops
+impact: LOW-MEDIUM
+impactDescription: reduces lookups
+tags: javascript, loops, optimization, caching
+---
+
+## Cache Property Access in Loops
+
+Cache object property lookups in hot paths.
+
+**Incorrect (3 lookups × N iterations):**
+
+```typescript
+for (let i = 0; i < arr.length; i++) {
+  process(obj.config.settings.value)
+}
+```
+
+**Correct (1 lookup total):**
+
+```typescript
+const value = obj.config.settings.value
+const len = arr.length
+for (let i = 0; i < len; i++) {
+  process(value)
+}
+```
diff --git a/.github/skills/react-best-practices/rules/js-cache-storage.md b/.github/skills/react-best-practices/rules/js-cache-storage.md
new file mode 100644
index 0000000..aa4a30c
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-cache-storage.md
@@ -0,0 +1,70 @@
+---
+title: Cache Storage API Calls
+impact: LOW-MEDIUM
+impactDescription: reduces expensive I/O
+tags: javascript, localStorage, storage, caching, performance
+---
+
+## Cache Storage API Calls
+
+`localStorage`, `sessionStorage`, and `document.cookie` are synchronous and expensive. Cache reads in memory.
+
+**Incorrect (reads storage on every call):**
+
+```typescript
+function getTheme() {
+  return localStorage.getItem('theme') ?? 'light'
+}
+// Called 10 times = 10 storage reads
+```
+
+**Correct (Map cache):**
+
+```typescript
+const storageCache = new Map<string, string | null>()
+
+function getLocalStorage(key: string) {
+  if (!storageCache.has(key)) {
+    storageCache.set(key, localStorage.getItem(key))
+  }
+  return storageCache.get(key)
+}
+
+function setLocalStorage(key: string, value: string) {
+  localStorage.setItem(key, value)
+  storageCache.set(key, value)  // keep cache in sync
+}
+```
+
+Use a Map (not a hook) so it works everywhere: utilities, event handlers, not just React components.
+
+**Cookie caching:**
+
+```typescript
+let cookieCache: Record<string, string> | null = null
+
+function getCookie(name: string) {
+  if (!cookieCache) {
+    cookieCache = Object.fromEntries(
+      document.cookie.split('; ').map(c => c.split('='))
+    )
+  }
+  return cookieCache[name]
+}
+```
+
+**Important (invalidate on external changes):**
+
+If storage can change externally (another tab, server-set cookies), invalidate cache:
+
+```typescript
+window.addEventListener('storage', (e) => {
+  if (e.key) storageCache.delete(e.key)
+})
+
+document.addEventListener('visibilitychange', () => {
+  if (document.visibilityState === 'visible') {
+    storageCache.clear()
+  }
+})
+```
diff --git a/.github/skills/react-best-practices/rules/js-combine-iterations.md b/.github/skills/react-best-practices/rules/js-combine-iterations.md
new file mode 100644
index 0000000..044d017
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-combine-iterations.md
@@ -0,0 +1,32 @@
+---
+title: Combine Multiple Array Iterations
+impact: LOW-MEDIUM
+impactDescription: reduces iterations
+tags: javascript, arrays, loops, performance
+---
+
+## Combine Multiple Array Iterations
+
+Multiple `.filter()` or `.map()` calls iterate the array multiple times. Combine into one loop.
+
+**Incorrect (3 iterations):**
+
+```typescript
+const admins = users.filter(u => u.isAdmin)
+const testers = users.filter(u => u.isTester)
+const inactive = users.filter(u => !u.isActive)
+```
+
+**Correct (1 iteration):**
+
+```typescript
+const admins: User[] = []
+const testers: User[] = []
+const inactive: User[] = []
+
+for (const user of users) {
+  if (user.isAdmin) admins.push(user)
+  if (user.isTester) testers.push(user)
+  if (!user.isActive) inactive.push(user)
+}
+```
diff --git a/.github/skills/react-best-practices/rules/js-early-exit.md b/.github/skills/react-best-practices/rules/js-early-exit.md
new file mode 100644
index 0000000..f46cb89
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-early-exit.md
@@ -0,0 +1,50 @@
+---
+title: Early Return from Functions
+impact: LOW-MEDIUM
+impactDescription: avoids unnecessary computation
+tags: javascript, functions, optimization, early-return
+---
+
+## Early Return from Functions
+
+Return early when result is determined to skip unnecessary processing.
+
+**Incorrect (processes all items even after finding answer):**
+
+```typescript
+function validateUsers(users: User[]) {
+  let hasError = false
+  let errorMessage = ''
+  
+  for (const user of users) {
+    if (!user.email) {
+      hasError = true
+      errorMessage = 'Email required'
+    }
+    if (!user.name) {
+      hasError = true
+      errorMessage = 'Name required'
+    }
+    // Continues checking all users even after error found
+  }
+  
+  return hasError ? { valid: false, error: errorMessage } : { valid: true }
+}
+```
+
+**Correct (returns immediately on first error):**
+
+```typescript
+function validateUsers(users: User[]) {
+  for (const user of users) {
+    if (!user.email) {
+      return { valid: false, error: 'Email required' }
+    }
+    if (!user.name) {
+      return { valid: false, error: 'Name required' }
+    }
+  }
+
+  return { valid: true }
+}
+```
diff --git a/.github/skills/react-best-practices/rules/js-flatmap-filter.md b/.github/skills/react-best-practices/rules/js-flatmap-filter.md
new file mode 100644
index 0000000..ee0edf0
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-flatmap-filter.md
@@ -0,0 +1,60 @@
+---
+title: Use flatMap to Map and Filter in One Pass
+impact: LOW-MEDIUM
+impactDescription: eliminates intermediate array
+tags: javascript, arrays, flatMap, filter, performance
+---
+
+## Use flatMap to Map and Filter in One Pass
+
+**Impact: LOW-MEDIUM (eliminates intermediate array)**
+
+Chaining `.map().filter(Boolean)` creates an intermediate array and iterates twice. Use `.flatMap()` to transform and filter in a single pass.
+
+**Incorrect (2 iterations, intermediate array):**
+
+```typescript
+const userNames = users
+  .map(user => user.isActive ? user.name : null)
+  .filter(Boolean)
+```
+
+**Correct (1 iteration, no intermediate array):**
+
+```typescript
+const userNames = users.flatMap(user =>
+  user.isActive ? [user.name] : []
+)
+```
+
+**More examples:**
+
+```typescript
+// Extract valid emails from responses
+// Before
+const emails = responses
+  .map(r => r.success ? r.data.email : null)
+  .filter(Boolean)
+
+// After
+const emails = responses.flatMap(r =>
+  r.success ? [r.data.email] : []
+)
+
+// Parse and filter valid numbers
+// Before
+const numbers = strings
+  .map(s => parseInt(s, 10))
+  .filter(n => !isNaN(n))
+
+// After
+const numbers = strings.flatMap(s => {
+  const n = parseInt(s, 10)
+  return isNaN(n) ? [] : [n]
+})
+```
+
+**When to use:**
+- Transforming items while filtering some out
+- Conditional mapping where some inputs produce no output
+- Parsing/validating where invalid inputs should be skipped
diff --git a/.github/skills/react-best-practices/rules/js-hoist-regexp.md b/.github/skills/react-best-practices/rules/js-hoist-regexp.md
new file mode 100644
index 0000000..dae3fef
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-hoist-regexp.md
@@ -0,0 +1,45 @@
+---
+title: Hoist RegExp Creation
+impact: LOW-MEDIUM
+impactDescription: avoids recreation
+tags: javascript, regexp, optimization, memoization
+---
+
+## Hoist RegExp Creation
+
+Don't create RegExp inside render. Hoist to module scope or memoize with `useMemo()`.
+
+**Incorrect (new RegExp every render):**
+
+```tsx
+function Highlighter({ text, query }: Props) {
+  const regex = new RegExp(`(${query})`, 'gi')
+  const parts = text.split(regex)
+  return <>{parts.map((part, i) => ...)}</>
+}
+```
+
+**Correct (memoize or hoist):**
+
+```tsx
+const EMAIL_REGEX = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
+
+function Highlighter({ text, query }: Props) {
+  const regex = useMemo(
+    () => new RegExp(`(${escapeRegex(query)})`, 'gi'),
+    [query]
+  )
+  const parts = text.split(regex)
+  return <>{parts.map((part, i) => ...)}</>
+}
+```
+
+**Warning (global regex has mutable state):**
+
+Global regex (`/g`) has mutable `lastIndex` state:
+
+```typescript
+const regex = /foo/g
+regex.test('foo')  // true, lastIndex = 3
+regex.test('foo')  // false, lastIndex = 0
+```
diff --git a/.github/skills/react-best-practices/rules/js-index-maps.md b/.github/skills/react-best-practices/rules/js-index-maps.md
new file mode 100644
index 0000000..9d357a0
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-index-maps.md
@@ -0,0 +1,37 @@
+---
+title: Build Index Maps for Repeated Lookups
+impact: LOW-MEDIUM
+impactDescription: 1M ops to 2K ops
+tags: javascript, map, indexing, optimization, performance
+---
+
+## Build Index Maps for Repeated Lookups
+
+Multiple `.find()` calls by the same key should use a Map.
+
+**Incorrect (O(n) per lookup):**
+
+```typescript
+function processOrders(orders: Order[], users: User[]) {
+  return orders.map(order => ({
+    ...order,
+    user: users.find(u => u.id === order.userId)
+  }))
+}
+```
+
+**Correct (O(1) per lookup):**
+
+```typescript
+function processOrders(orders: Order[], users: User[]) {
+  const userById = new Map(users.map(u => [u.id, u]))
+
+  return orders.map(order => ({
+    ...order,
+    user: userById.get(order.userId)
+  }))
+}
+```
+
+Build map once (O(n)), then all lookups are O(1).
+For 1000 orders × 1000 users: 1M ops → 2K ops.
diff --git a/.github/skills/react-best-practices/rules/js-length-check-first.md b/.github/skills/react-best-practices/rules/js-length-check-first.md
new file mode 100644
index 0000000..8b89573
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-length-check-first.md
@@ -0,0 +1,49 @@
+---
+title: Early Length Check for Array Comparisons
+impact: MEDIUM-HIGH
+impactDescription: avoids expensive operations when lengths differ
+tags: javascript, arrays, performance, optimization, comparison
+---
+
+## Early Length Check for Array Comparisons
+
+When comparing arrays with expensive operations (sorting, deep equality, serialization), check lengths first. If lengths differ, the arrays cannot be equal.
+
+In real-world applications, this optimization is especially valuable when the comparison runs in hot paths (event handlers, render loops).
+
+**Incorrect (always runs expensive comparison):**
+
+```typescript
+function hasChanges(current: string[], original: string[]) {
+  // Always sorts and joins, even when lengths differ
+  return current.sort().join() !== original.sort().join()
+}
+```
+
+Two O(n log n) sorts run even when `current.length` is 5 and `original.length` is 100. There is also overhead of joining the arrays and comparing the strings.
+
+**Correct (O(1) length check first):**
+
+```typescript
+function hasChanges(current: string[], original: string[]) {
+  // Early return if lengths differ
+  if (current.length !== original.length) {
+    return true
+  }
+  // Only sort when lengths match
+  const currentSorted = current.toSorted()
+  const originalSorted = original.toSorted()
+  for (let i = 0; i < currentSorted.length; i++) {
+    if (currentSorted[i] !== originalSorted[i]) {
+      return true
+    }
+  }
+  return false
+}
+```
+
+This new approach is more efficient because:
+- It avoids the overhead of sorting and joining the arrays when lengths differ
+- It avoids consuming memory for the joined strings (especially important for large arrays)
+- It avoids mutating the original arrays
+- It returns early when a difference is found
diff --git a/.github/skills/react-best-practices/rules/js-min-max-loop.md b/.github/skills/react-best-practices/rules/js-min-max-loop.md
new file mode 100644
index 0000000..4b6656e
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-min-max-loop.md
@@ -0,0 +1,82 @@
+---
+title: Use Loop for Min/Max Instead of Sort
+impact: LOW
+impactDescription: O(n) instead of O(n log n)
+tags: javascript, arrays, performance, sorting, algorithms
+---
+
+## Use Loop for Min/Max Instead of Sort
+
+Finding the smallest or largest element only requires a single pass through the array. Sorting is wasteful and slower.
+
+**Incorrect (O(n log n) - sort to find latest):**
+
+```typescript
+interface Project {
+  id: string
+  name: string
+  updatedAt: number
+}
+
+function getLatestProject(projects: Project[]) {
+  const sorted = [...projects].sort((a, b) => b.updatedAt - a.updatedAt)
+  return sorted[0]
+}
+```
+
+Sorts the entire array just to find the maximum value.
+
+**Incorrect (O(n log n) - sort for oldest and newest):**
+
+```typescript
+function getOldestAndNewest(projects: Project[]) {
+  const sorted = [...projects].sort((a, b) => a.updatedAt - b.updatedAt)
+  return { oldest: sorted[0], newest: sorted[sorted.length - 1] }
+}
+```
+
+Still sorts unnecessarily when only min/max are needed.
+
+**Correct (O(n) - single loop):**
+
+```typescript
+function getLatestProject(projects: Project[]) {
+  if (projects.length === 0) return null
+  
+  let latest = projects[0]
+  
+  for (let i = 1; i < projects.length; i++) {
+    if (projects[i].updatedAt > latest.updatedAt) {
+      latest = projects[i]
+    }
+  }
+  
+  return latest
+}
+
+function getOldestAndNewest(projects: Project[]) {
+  if (projects.length === 0) return { oldest: null, newest: null }
+  
+  let oldest = projects[0]
+  let newest = projects[0]
+  
+  for (let i = 1; i < projects.length; i++) {
+    if (projects[i].updatedAt < oldest.updatedAt) oldest = projects[i]
+    if (projects[i].updatedAt > newest.updatedAt) newest = projects[i]
+  }
+  
+  return { oldest, newest }
+}
+```
+
+Single pass through the array, no copying, no sorting.
+
+**Alternative (Math.min/Math.max for small arrays):**
+
+```typescript
+const numbers = [5, 2, 8, 1, 9]
+const min = Math.min(...numbers)
+const max = Math.max(...numbers)
+```
+
+This works for small arrays, but can be slower or just throw an error for very large arrays due to spread operator limitations. Maximal array length is approximately 124000 in Chrome 143 and 638000 in Safari 18; exact numbers may vary - see [the fiddle](https://jsfiddle.net/qw1jabsx/4/). Use the loop approach for reliability.
diff --git a/.github/skills/react-best-practices/rules/js-set-map-lookups.md b/.github/skills/react-best-practices/rules/js-set-map-lookups.md
new file mode 100644
index 0000000..680a489
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-set-map-lookups.md
@@ -0,0 +1,24 @@
+---
+title: Use Set/Map for O(1) Lookups
+impact: LOW-MEDIUM
+impactDescription: O(n) to O(1)
+tags: javascript, set, map, data-structures, performance
+---
+
+## Use Set/Map for O(1) Lookups
+
+Convert arrays to Set/Map for repeated membership checks.
+
+**Incorrect (O(n) per check):**
+
+```typescript
+const allowedIds = ['a', 'b', 'c', ...]
+items.filter(item => allowedIds.includes(item.id))
+```
+
+**Correct (O(1) per check):**
+
+```typescript
+const allowedIds = new Set(['a', 'b', 'c', ...])
+items.filter(item => allowedIds.has(item.id))
+```
diff --git a/.github/skills/react-best-practices/rules/js-tosorted-immutable.md b/.github/skills/react-best-practices/rules/js-tosorted-immutable.md
new file mode 100644
index 0000000..eae8b3f
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/js-tosorted-immutable.md
@@ -0,0 +1,57 @@
+---
+title: Use toSorted() Instead of sort() for Immutability
+impact: MEDIUM-HIGH
+impactDescription: prevents mutation bugs in React state
+tags: javascript, arrays, immutability, react, state, mutation
+---
+
+## Use toSorted() Instead of sort() for Immutability
+
+`.sort()` mutates the array in place, which can cause bugs with React state and props. Use `.toSorted()` to create a new sorted array without mutation.
+
+**Incorrect (mutates original array):**
+
+```typescript
+function UserList({ users }: { users: User[] }) {
+  // Mutates the users prop array!
+  const sorted = useMemo(
+    () => users.sort((a, b) => a.name.localeCompare(b.name)),
+    [users]
+  )
+  return <div>{sorted.map(renderUser)}</div>
+}
+```
+
+**Correct (creates new array):**
+
+```typescript
+function UserList({ users }: { users: User[] }) {
+  // Creates new sorted array, original unchanged
+  const sorted = useMemo(
+    () => users.toSorted((a, b) => a.name.localeCompare(b.name)),
+    [users]
+  )
+  return <div>{sorted.map(renderUser)}</div>
+}
+```
+
+**Why this matters in React:**
+
+1. Props/state mutations break React's immutability model - React expects props and state to be treated as read-only
+2. Causes stale closure bugs - Mutating arrays inside closures (callbacks, effects) can lead to unexpected behavior
+
+**Browser support (fallback for older browsers):**
+
+`.toSorted()` is available in all modern browsers (Chrome 110+, Safari 16+, Firefox 115+, Node.js 20+). For older environments, use spread operator:
+
+```typescript
+// Fallback for older browsers
+const sorted = [...items].sort((a, b) => a.value - b.value)
+```
+
+**Other immutable array methods:**
+
+- `.toSorted()` - immutable sort
+- `.toReversed()` - immutable reverse
+- `.toSpliced()` - immutable splice
+- `.with()` - immutable element replacement
diff --git a/.github/skills/react-best-practices/rules/rendering-activity.md b/.github/skills/react-best-practices/rules/rendering-activity.md
new file mode 100644
index 0000000..c957a49
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-activity.md
@@ -0,0 +1,26 @@
+---
+title: Use Activity Component for Show/Hide
+impact: MEDIUM
+impactDescription: preserves state/DOM
+tags: rendering, activity, visibility, state-preservation
+---
+
+## Use Activity Component for Show/Hide
+
+Use React's `<Activity>` to preserve state/DOM for expensive components that frequently toggle visibility.
+
+**Usage:**
+
+```tsx
+import { Activity } from 'react'
+
+function Dropdown({ isOpen }: Props) {
+  return (
+    <Activity mode={isOpen ? 'visible' : 'hidden'}>
+      <ExpensiveMenu />
+    </Activity>
+  )
+}
+```
+
+Avoids expensive re-renders and state loss.
diff --git a/.github/skills/react-best-practices/rules/rendering-animate-svg-wrapper.md b/.github/skills/react-best-practices/rules/rendering-animate-svg-wrapper.md
new file mode 100644
index 0000000..646744c
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-animate-svg-wrapper.md
@@ -0,0 +1,47 @@
+---
+title: Animate SVG Wrapper Instead of SVG Element
+impact: LOW
+impactDescription: enables hardware acceleration
+tags: rendering, svg, css, animation, performance
+---
+
+## Animate SVG Wrapper Instead of SVG Element
+
+Many browsers don't have hardware acceleration for CSS3 animations on SVG elements. Wrap SVG in a `<div>` and animate the wrapper instead.
+
+**Incorrect (animating SVG directly - no hardware acceleration):**
+
+```tsx
+function LoadingSpinner() {
+  return (
+    <svg 
+      className="animate-spin"
+      width="24" 
+      height="24" 
+      viewBox="0 0 24 24"
+    >
+      <circle cx="12" cy="12" r="10" stroke="currentColor" />
+    </svg>
+  )
+}
+```
+
+**Correct (animating wrapper div - hardware accelerated):**
+
+```tsx
+function LoadingSpinner() {
+  return (
+    <div className="animate-spin">
+      <svg 
+        width="24" 
+        height="24" 
+        viewBox="0 0 24 24"
+      >
+        <circle cx="12" cy="12" r="10" stroke="currentColor" />
+      </svg>
+    </div>
+  )
+}
+```
+
+This applies to all CSS transforms and transitions (`transform`, `opacity`, `translate`, `scale`, `rotate`). The wrapper div allows browsers to use GPU acceleration for smoother animations.
diff --git a/.github/skills/react-best-practices/rules/rendering-conditional-render.md b/.github/skills/react-best-practices/rules/rendering-conditional-render.md
new file mode 100644
index 0000000..7e866f5
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-conditional-render.md
@@ -0,0 +1,40 @@
+---
+title: Use Explicit Conditional Rendering
+impact: LOW
+impactDescription: prevents rendering 0 or NaN
+tags: rendering, conditional, jsx, falsy-values
+---
+
+## Use Explicit Conditional Rendering
+
+Use explicit ternary operators (`? :`) instead of `&&` for conditional rendering when the condition can be `0`, `NaN`, or other falsy values that render.
+
+**Incorrect (renders "0" when count is 0):**
+
+```tsx
+function Badge({ count }: { count: number }) {
+  return (
+    <div>
+      {count && <span className="badge">{count}</span>}
+    </div>
+  )
+}
+
+// When count = 0, renders: <div>0</div>
+// When count = 5, renders: <div><span class="badge">5</span></div>
+```
+
+**Correct (renders nothing when count is 0):**
+
+```tsx
+function Badge({ count }: { count: number }) {
+  return (
+    <div>
+      {count > 0 ? <span className="badge">{count}</span> : null}
+    </div>
+  )
+}
+
+// When count = 0, renders: <div></div>
+// When count = 5, renders: <div><span class="badge">5</span></div>
+```
diff --git a/.github/skills/react-best-practices/rules/rendering-content-visibility.md b/.github/skills/react-best-practices/rules/rendering-content-visibility.md
new file mode 100644
index 0000000..aa66563
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-content-visibility.md
@@ -0,0 +1,38 @@
+---
+title: CSS content-visibility for Long Lists
+impact: HIGH
+impactDescription: faster initial render
+tags: rendering, css, content-visibility, long-lists
+---
+
+## CSS content-visibility for Long Lists
+
+Apply `content-visibility: auto` to defer off-screen rendering.
+
+**CSS:**
+
+```css
+.message-item {
+  content-visibility: auto;
+  contain-intrinsic-size: 0 80px;
+}
+```
+
+**Example:**
+
+```tsx
+function MessageList({ messages }: { messages: Message[] }) {
+  return (
+    <div className="overflow-y-auto h-screen">
+      {messages.map(msg => (
+        <div key={msg.id} className="message-item">
+          <Avatar user={msg.author} />
+          <div>{msg.content}</div>
+        </div>
+      ))}
+    </div>
+  )
+}
+```
+
+For 1000 messages, browser skips layout/paint for ~990 off-screen items (10× faster initial render).
diff --git a/.github/skills/react-best-practices/rules/rendering-hoist-jsx.md b/.github/skills/react-best-practices/rules/rendering-hoist-jsx.md
new file mode 100644
index 0000000..32d2f3f
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-hoist-jsx.md
@@ -0,0 +1,46 @@
+---
+title: Hoist Static JSX Elements
+impact: LOW
+impactDescription: avoids re-creation
+tags: rendering, jsx, static, optimization
+---
+
+## Hoist Static JSX Elements
+
+Extract static JSX outside components to avoid re-creation.
+
+**Incorrect (recreates element every render):**
+
+```tsx
+function LoadingSkeleton() {
+  return <div className="animate-pulse h-20 bg-gray-200" />
+}
+
+function Container() {
+  return (
+    <div>
+      {loading && <LoadingSkeleton />}
+    </div>
+  )
+}
+```
+
+**Correct (reuses same element):**
+
+```tsx
+const loadingSkeleton = (
+  <div className="animate-pulse h-20 bg-gray-200" />
+)
+
+function Container() {
+  return (
+    <div>
+      {loading && loadingSkeleton}
+    </div>
+  )
+}
+```
+
+This is especially helpful for large and static SVG nodes, which can be expensive to recreate on every render.
+
+**Note:** If your project has [React Compiler](https://react.dev/learn/react-compiler) enabled, the compiler automatically hoists static JSX elements and optimizes component re-renders, making manual hoisting unnecessary.
diff --git a/.github/skills/react-best-practices/rules/rendering-hydration-no-flicker.md b/.github/skills/react-best-practices/rules/rendering-hydration-no-flicker.md
new file mode 100644
index 0000000..5cf0e79
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-hydration-no-flicker.md
@@ -0,0 +1,82 @@
+---
+title: Prevent Hydration Mismatch Without Flickering
+impact: MEDIUM
+impactDescription: avoids visual flicker and hydration errors
+tags: rendering, ssr, hydration, localStorage, flicker
+---
+
+## Prevent Hydration Mismatch Without Flickering
+
+When rendering content that depends on client-side storage (localStorage, cookies), avoid both SSR breakage and post-hydration flickering by injecting a synchronous script that updates the DOM before React hydrates.
+
+**Incorrect (breaks SSR):**
+
+```tsx
+function ThemeWrapper({ children }: { children: ReactNode }) {
+  // localStorage is not available on server - throws error
+  const theme = localStorage.getItem('theme') || 'light'
+  
+  return (
+    <div className={theme}>
+      {children}
+    </div>
+  )
+}
+```
+
+Server-side rendering will fail because `localStorage` is undefined.
+
+**Incorrect (visual flickering):**
+
+```tsx
+function ThemeWrapper({ children }: { children: ReactNode }) {
+  const [theme, setTheme] = useState('light')
+  
+  useEffect(() => {
+    // Runs after hydration - causes visible flash
+    const stored = localStorage.getItem('theme')
+    if (stored) {
+      setTheme(stored)
+    }
+  }, [])
+  
+  return (
+    <div className={theme}>
+      {children}
+    </div>
+  )
+}
+```
+
+Component first renders with default value (`light`), then updates after hydration, causing a visible flash of incorrect content.
+
+**Correct (no flicker, no hydration mismatch):**
+
+```tsx
+function ThemeWrapper({ children }: { children: ReactNode }) {
+  return (
+    <>
+      <div id="theme-wrapper">
+        {children}
+      </div>
+      <script
+        dangerouslySetInnerHTML={{
+          __html: `
+            (function() {
+              try {
+                var theme = localStorage.getItem('theme') || 'light';
+                var el = document.getElementById('theme-wrapper');
+                if (el) el.className = theme;
+              } catch (e) {}
+            })();
+          `,
+        }}
+      />
+    </>
+  )
+}
+```
+
+The inline script executes synchronously before showing the element, ensuring the DOM already has the correct value. No flickering, no hydration mismatch.
+
+This pattern is especially useful for theme toggles, user preferences, authentication states, and any client-only data that should render immediately without flashing default values.
diff --git a/.github/skills/react-best-practices/rules/rendering-hydration-suppress-warning.md b/.github/skills/react-best-practices/rules/rendering-hydration-suppress-warning.md
new file mode 100644
index 0000000..24ba251
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-hydration-suppress-warning.md
@@ -0,0 +1,30 @@
+---
+title: Suppress Expected Hydration Mismatches
+impact: LOW-MEDIUM
+impactDescription: avoids noisy hydration warnings for known differences
+tags: rendering, hydration, ssr, nextjs
+---
+
+## Suppress Expected Hydration Mismatches
+
+In SSR frameworks (e.g., Next.js), some values are intentionally different on server vs client (random IDs, dates, locale/timezone formatting). For these *expected* mismatches, wrap the dynamic text in an element with `suppressHydrationWarning` to prevent noisy warnings. Do not use this to hide real bugs. Don’t overuse it.
+
+**Incorrect (known mismatch warnings):**
+
+```tsx
+function Timestamp() {
+  return <span>{new Date().toLocaleString()}</span>
+}
+```
+
+**Correct (suppress expected mismatch only):**
+
+```tsx
+function Timestamp() {
+  return (
+    <span suppressHydrationWarning>
+      {new Date().toLocaleString()}
+    </span>
+  )
+}
+```
diff --git a/.github/skills/react-best-practices/rules/rendering-resource-hints.md b/.github/skills/react-best-practices/rules/rendering-resource-hints.md
new file mode 100644
index 0000000..1290bef
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-resource-hints.md
@@ -0,0 +1,85 @@
+---
+title: Use React DOM Resource Hints
+impact: HIGH
+impactDescription: reduces load time for critical resources
+tags: rendering, preload, preconnect, prefetch, resource-hints
+---
+
+## Use React DOM Resource Hints
+
+**Impact: HIGH (reduces load time for critical resources)**
+
+React DOM provides APIs to hint the browser about resources it will need. These are especially useful in server components to start loading resources before the client even receives the HTML.
+
+- **`prefetchDNS(href)`**: Resolve DNS for a domain you expect to connect to
+- **`preconnect(href)`**: Establish connection (DNS + TCP + TLS) to a server
+- **`preload(href, options)`**: Fetch a resource (stylesheet, font, script, image) you'll use soon
+- **`preloadModule(href)`**: Fetch an ES module you'll use soon
+- **`preinit(href, options)`**: Fetch and evaluate a stylesheet or script
+- **`preinitModule(href)`**: Fetch and evaluate an ES module
+
+**Example (preconnect to third-party APIs):**
+
+```tsx
+import { preconnect, prefetchDNS } from 'react-dom'
+
+export default function App() {
+  prefetchDNS('https://analytics.example.com')
+  preconnect('https://api.example.com')
+
+  return <main>{/* content */}</main>
+}
+```
+
+**Example (preload critical fonts and styles):**
+
+```tsx
+import { preload, preinit } from 'react-dom'
+
+export default function RootLayout({ children }) {
+  // Preload font file
+  preload('/fonts/inter.woff2', { as: 'font', type: 'font/woff2', crossOrigin: 'anonymous' })
+
+  // Fetch and apply critical stylesheet immediately
+  preinit('/styles/critical.css', { as: 'style' })
+
+  return (
+    <html>
+      <body>{children}</body>
+    </html>
+  )
+}
+```
+
+**Example (preload modules for code-split routes):**
+
+```tsx
+import { preloadModule, preinitModule } from 'react-dom'
+
+function Navigation() {
+  const preloadDashboard = () => {
+    preloadModule('/dashboard.js', { as: 'script' })
+  }
+
+  return (
+    <nav>
+      <a href="/dashboard" onMouseEnter={preloadDashboard}>
+        Dashboard
+      </a>
+    </nav>
+  )
+}
+```
+
+**When to use each:**
+
+| API | Use case |
+|-----|----------|
+| `prefetchDNS` | Third-party domains you'll connect to later |
+| `preconnect` | APIs or CDNs you'll fetch from immediately |
+| `preload` | Critical resources needed for current page |
+| `preloadModule` | JS modules for likely next navigation |
+| `preinit` | Stylesheets/scripts that must execute early |
+| `preinitModule` | ES modules that must execute early |
+
+Reference: [React DOM Resource Preloading APIs](https://react.dev/reference/react-dom#resource-preloading-apis)
diff --git a/.github/skills/react-best-practices/rules/rendering-script-defer-async.md b/.github/skills/react-best-practices/rules/rendering-script-defer-async.md
new file mode 100644
index 0000000..ee275ed
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-script-defer-async.md
@@ -0,0 +1,68 @@
+---
+title: Use defer or async on Script Tags
+impact: HIGH
+impactDescription: eliminates render-blocking
+tags: rendering, script, defer, async, performance
+---
+
+## Use defer or async on Script Tags
+
+**Impact: HIGH (eliminates render-blocking)**
+
+Script tags without `defer` or `async` block HTML parsing while the script downloads and executes. This delays First Contentful Paint and Time to Interactive.
+
+- **`defer`**: Downloads in parallel, executes after HTML parsing completes, maintains execution order
+- **`async`**: Downloads in parallel, executes immediately when ready, no guaranteed order
+
+Use `defer` for scripts that depend on DOM or other scripts. Use `async` for independent scripts like analytics.
+
+**Incorrect (blocks rendering):**
+
+```tsx
+export default function Document() {
+  return (
+    <html>
+      <head>
+        <script src="https://example.com/analytics.js" />
+        <script src="/scripts/utils.js" />
+      </head>
+      <body>{/* content */}</body>
+    </html>
+  )
+}
+```
+
+**Correct (non-blocking):**
+
+```tsx
+export default function Document() {
+  return (
+    <html>
+      <head>
+        {/* Independent script - use async */}
+        <script src="https://example.com/analytics.js" async />
+        {/* DOM-dependent script - use defer */}
+        <script src="/scripts/utils.js" defer />
+      </head>
+      <body>{/* content */}</body>
+    </html>
+  )
+}
+```
+
+**Note:** In Next.js, prefer the `next/script` component with `strategy` prop instead of raw script tags:
+
+```tsx
+import Script from 'next/script'
+
+export default function Page() {
+  return (
+    <>
+      <Script src="https://example.com/analytics.js" strategy="afterInteractive" />
+      <Script src="/scripts/utils.js" strategy="beforeInteractive" />
+    </>
+  )
+}
+```
+
+Reference: [MDN - Script element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script#defer)
diff --git a/.github/skills/react-best-practices/rules/rendering-svg-precision.md b/.github/skills/react-best-practices/rules/rendering-svg-precision.md
new file mode 100644
index 0000000..6d77128
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-svg-precision.md
@@ -0,0 +1,28 @@
+---
+title: Optimize SVG Precision
+impact: LOW
+impactDescription: reduces file size
+tags: rendering, svg, optimization, svgo
+---
+
+## Optimize SVG Precision
+
+Reduce SVG coordinate precision to decrease file size. The optimal precision depends on the viewBox size, but in general reducing precision should be considered.
+
+**Incorrect (excessive precision):**
+
+```svg
+<path d="M 10.293847 20.847362 L 30.938472 40.192837" />
+```
+
+**Correct (1 decimal place):**
+
+```svg
+<path d="M 10.3 20.8 L 30.9 40.2" />
+```
+
+**Automate with SVGO:**
+
+```bash
+npx svgo --precision=1 --multipass icon.svg
+```
diff --git a/.github/skills/react-best-practices/rules/rendering-usetransition-loading.md b/.github/skills/react-best-practices/rules/rendering-usetransition-loading.md
new file mode 100644
index 0000000..0c1b0b9
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rendering-usetransition-loading.md
@@ -0,0 +1,75 @@
+---
+title: Use useTransition Over Manual Loading States
+impact: LOW
+impactDescription: reduces re-renders and improves code clarity
+tags: rendering, transitions, useTransition, loading, state
+---
+
+## Use useTransition Over Manual Loading States
+
+Use `useTransition` instead of manual `useState` for loading states. This provides built-in `isPending` state and automatically manages transitions.
+
+**Incorrect (manual loading state):**
+
+```tsx
+function SearchResults() {
+  const [query, setQuery] = useState('')
+  const [results, setResults] = useState([])
+  const [isLoading, setIsLoading] = useState(false)
+
+  const handleSearch = async (value: string) => {
+    setIsLoading(true)
+    setQuery(value)
+    const data = await fetchResults(value)
+    setResults(data)
+    setIsLoading(false)
+  }
+
+  return (
+    <>
+      <input onChange={(e) => handleSearch(e.target.value)} />
+      {isLoading && <Spinner />}
+      <ResultsList results={results} />
+    </>
+  )
+}
+```
+
+**Correct (useTransition with built-in pending state):**
+
+```tsx
+import { useTransition, useState } from 'react'
+
+function SearchResults() {
+  const [query, setQuery] = useState('')
+  const [results, setResults] = useState([])
+  const [isPending, startTransition] = useTransition()
+
+  const handleSearch = (value: string) => {
+    setQuery(value) // Update input immediately
+    
+    startTransition(async () => {
+      // Fetch and update results
+      const data = await fetchResults(value)
+      setResults(data)
+    })
+  }
+
+  return (
+    <>
+      <input onChange={(e) => handleSearch(e.target.value)} />
+      {isPending && <Spinner />}
+      <ResultsList results={results} />
+    </>
+  )
+}
+```
+
+**Benefits:**
+
+- **Automatic pending state**: No need to manually manage `setIsLoading(true/false)`
+- **Error resilience**: Pending state correctly resets even if the transition throws
+- **Better responsiveness**: Keeps the UI responsive during updates
+- **Interrupt handling**: New transitions automatically cancel pending ones
+
+Reference: [useTransition](https://react.dev/reference/react/useTransition)
diff --git a/.github/skills/react-best-practices/rules/rerender-defer-reads.md b/.github/skills/react-best-practices/rules/rerender-defer-reads.md
new file mode 100644
index 0000000..e867c95
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-defer-reads.md
@@ -0,0 +1,39 @@
+---
+title: Defer State Reads to Usage Point
+impact: MEDIUM
+impactDescription: avoids unnecessary subscriptions
+tags: rerender, searchParams, localStorage, optimization
+---
+
+## Defer State Reads to Usage Point
+
+Don't subscribe to dynamic state (searchParams, localStorage) if you only read it inside callbacks.
+
+**Incorrect (subscribes to all searchParams changes):**
+
+```tsx
+function ShareButton({ chatId }: { chatId: string }) {
+  const searchParams = useSearchParams()
+
+  const handleShare = () => {
+    const ref = searchParams.get('ref')
+    shareChat(chatId, { ref })
+  }
+
+  return <button onClick={handleShare}>Share</button>
+}
+```
+
+**Correct (reads on demand, no subscription):**
+
+```tsx
+function ShareButton({ chatId }: { chatId: string }) {
+  const handleShare = () => {
+    const params = new URLSearchParams(window.location.search)
+    const ref = params.get('ref')
+    shareChat(chatId, { ref })
+  }
+
+  return <button onClick={handleShare}>Share</button>
+}
+```
diff --git a/.github/skills/react-best-practices/rules/rerender-dependencies.md b/.github/skills/react-best-practices/rules/rerender-dependencies.md
new file mode 100644
index 0000000..47a4d92
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-dependencies.md
@@ -0,0 +1,45 @@
+---
+title: Narrow Effect Dependencies
+impact: LOW
+impactDescription: minimizes effect re-runs
+tags: rerender, useEffect, dependencies, optimization
+---
+
+## Narrow Effect Dependencies
+
+Specify primitive dependencies instead of objects to minimize effect re-runs.
+
+**Incorrect (re-runs on any user field change):**
+
+```tsx
+useEffect(() => {
+  console.log(user.id)
+}, [user])
+```
+
+**Correct (re-runs only when id changes):**
+
+```tsx
+useEffect(() => {
+  console.log(user.id)
+}, [user.id])
+```
+
+**For derived state, compute outside effect:**
+
+```tsx
+// Incorrect: runs on width=767, 766, 765...
+useEffect(() => {
+  if (width < 768) {
+    enableMobileMode()
+  }
+}, [width])
+
+// Correct: runs only on boolean transition
+const isMobile = width < 768
+useEffect(() => {
+  if (isMobile) {
+    enableMobileMode()
+  }
+}, [isMobile])
+```
diff --git a/.github/skills/react-best-practices/rules/rerender-derived-state-no-effect.md b/.github/skills/react-best-practices/rules/rerender-derived-state-no-effect.md
new file mode 100644
index 0000000..3d9fe40
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-derived-state-no-effect.md
@@ -0,0 +1,40 @@
+---
+title: Calculate Derived State During Rendering
+impact: MEDIUM
+impactDescription: avoids redundant renders and state drift
+tags: rerender, derived-state, useEffect, state
+---
+
+## Calculate Derived State During Rendering
+
+If a value can be computed from current props/state, do not store it in state or update it in an effect. Derive it during render to avoid extra renders and state drift. Do not set state in effects solely in response to prop changes; prefer derived values or keyed resets instead.
+
+**Incorrect (redundant state and effect):**
+
+```tsx
+function Form() {
+  const [firstName, setFirstName] = useState('First')
+  const [lastName, setLastName] = useState('Last')
+  const [fullName, setFullName] = useState('')
+
+  useEffect(() => {
+    setFullName(firstName + ' ' + lastName)
+  }, [firstName, lastName])
+
+  return <p>{fullName}</p>
+}
+```
+
+**Correct (derive during render):**
+
+```tsx
+function Form() {
+  const [firstName, setFirstName] = useState('First')
+  const [lastName, setLastName] = useState('Last')
+  const fullName = firstName + ' ' + lastName
+
+  return <p>{fullName}</p>
+}
+```
+
+References: [You Might Not Need an Effect](https://react.dev/learn/you-might-not-need-an-effect)
diff --git a/.github/skills/react-best-practices/rules/rerender-derived-state.md b/.github/skills/react-best-practices/rules/rerender-derived-state.md
new file mode 100644
index 0000000..e5c899f
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-derived-state.md
@@ -0,0 +1,29 @@
+---
+title: Subscribe to Derived State
+impact: MEDIUM
+impactDescription: reduces re-render frequency
+tags: rerender, derived-state, media-query, optimization
+---
+
+## Subscribe to Derived State
+
+Subscribe to derived boolean state instead of continuous values to reduce re-render frequency.
+
+**Incorrect (re-renders on every pixel change):**
+
+```tsx
+function Sidebar() {
+  const width = useWindowWidth()  // updates continuously
+  const isMobile = width < 768
+  return <nav className={isMobile ? 'mobile' : 'desktop'} />
+}
+```
+
+**Correct (re-renders only when boolean changes):**
+
+```tsx
+function Sidebar() {
+  const isMobile = useMediaQuery('(max-width: 767px)')
+  return <nav className={isMobile ? 'mobile' : 'desktop'} />
+}
+```
diff --git a/.github/skills/react-best-practices/rules/rerender-functional-setstate.md b/.github/skills/react-best-practices/rules/rerender-functional-setstate.md
new file mode 100644
index 0000000..b004ef4
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-functional-setstate.md
@@ -0,0 +1,74 @@
+---
+title: Use Functional setState Updates
+impact: MEDIUM
+impactDescription: prevents stale closures and unnecessary callback recreations
+tags: react, hooks, useState, useCallback, callbacks, closures
+---
+
+## Use Functional setState Updates
+
+When updating state based on the current state value, use the functional update form of setState instead of directly referencing the state variable. This prevents stale closures, eliminates unnecessary dependencies, and creates stable callback references.
+
+**Incorrect (requires state as dependency):**
+
+```tsx
+function TodoList() {
+  const [items, setItems] = useState(initialItems)
+  
+  // Callback must depend on items, recreated on every items change
+  const addItems = useCallback((newItems: Item[]) => {
+    setItems([...items, ...newItems])
+  }, [items])  // ❌ items dependency causes recreations
+  
+  // Risk of stale closure if dependency is forgotten
+  const removeItem = useCallback((id: string) => {
+    setItems(items.filter(item => item.id !== id))
+  }, [])  // ❌ Missing items dependency - will use stale items!
+  
+  return <ItemsEditor items={items} onAdd={addItems} onRemove={removeItem} />
+}
+```
+
+The first callback is recreated every time `items` changes, which can cause child components to re-render unnecessarily. The second callback has a stale closure bug—it will always reference the initial `items` value.
+
+**Correct (stable callbacks, no stale closures):**
+
+```tsx
+function TodoList() {
+  const [items, setItems] = useState(initialItems)
+  
+  // Stable callback, never recreated
+  const addItems = useCallback((newItems: Item[]) => {
+    setItems(curr => [...curr, ...newItems])
+  }, [])  // ✅ No dependencies needed
+  
+  // Always uses latest state, no stale closure risk
+  const removeItem = useCallback((id: string) => {
+    setItems(curr => curr.filter(item => item.id !== id))
+  }, [])  // ✅ Safe and stable
+  
+  return <ItemsEditor items={items} onAdd={addItems} onRemove={removeItem} />
+}
+```
+
+**Benefits:**
+
+1. **Stable callback references** - Callbacks don't need to be recreated when state changes
+2. **No stale closures** - Always operates on the latest state value
+3. **Fewer dependencies** - Simplifies dependency arrays and reduces memory leaks
+4. **Prevents bugs** - Eliminates the most common source of React closure bugs
+
+**When to use functional updates:**
+
+- Any setState that depends on the current state value
+- Inside useCallback/useMemo when state is needed
+- Event handlers that reference state
+- Async operations that update state
+
+**When direct updates are fine:**
+
+- Setting state to a static value: `setCount(0)`
+- Setting state from props/arguments only: `setName(newName)`
+- State doesn't depend on previous value
+
+**Note:** If your project has [React Compiler](https://react.dev/learn/react-compiler) enabled, the compiler can automatically optimize some cases, but functional updates are still recommended for correctness and to prevent stale closure bugs.
diff --git a/.github/skills/react-best-practices/rules/rerender-lazy-state-init.md b/.github/skills/react-best-practices/rules/rerender-lazy-state-init.md
new file mode 100644
index 0000000..4ecb350
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-lazy-state-init.md
@@ -0,0 +1,58 @@
+---
+title: Use Lazy State Initialization
+impact: MEDIUM
+impactDescription: wasted computation on every render
+tags: react, hooks, useState, performance, initialization
+---
+
+## Use Lazy State Initialization
+
+Pass a function to `useState` for expensive initial values. Without the function form, the initializer runs on every render even though the value is only used once.
+
+**Incorrect (runs on every render):**
+
+```tsx
+function FilteredList({ items }: { items: Item[] }) {
+  // buildSearchIndex() runs on EVERY render, even after initialization
+  const [searchIndex, setSearchIndex] = useState(buildSearchIndex(items))
+  const [query, setQuery] = useState('')
+  
+  // When query changes, buildSearchIndex runs again unnecessarily
+  return <SearchResults index={searchIndex} query={query} />
+}
+
+function UserProfile() {
+  // JSON.parse runs on every render
+  const [settings, setSettings] = useState(
+    JSON.parse(localStorage.getItem('settings') || '{}')
+  )
+  
+  return <SettingsForm settings={settings} onChange={setSettings} />
+}
+```
+
+**Correct (runs only once):**
+
+```tsx
+function FilteredList({ items }: { items: Item[] }) {
+  // buildSearchIndex() runs ONLY on initial render
+  const [searchIndex, setSearchIndex] = useState(() => buildSearchIndex(items))
+  const [query, setQuery] = useState('')
+  
+  return <SearchResults index={searchIndex} query={query} />
+}
+
+function UserProfile() {
+  // JSON.parse runs only on initial render
+  const [settings, setSettings] = useState(() => {
+    const stored = localStorage.getItem('settings')
+    return stored ? JSON.parse(stored) : {}
+  })
+  
+  return <SettingsForm settings={settings} onChange={setSettings} />
+}
+```
+
+Use lazy initialization when computing initial values from localStorage/sessionStorage, building data structures (indexes, maps), reading from the DOM, or performing heavy transformations.
+
+For simple primitives (`useState(0)`), direct references (`useState(props.value)`), or cheap literals (`useState({})`), the function form is unnecessary.
diff --git a/.github/skills/react-best-practices/rules/rerender-memo-with-default-value.md b/.github/skills/react-best-practices/rules/rerender-memo-with-default-value.md
new file mode 100644
index 0000000..6357049
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-memo-with-default-value.md
@@ -0,0 +1,38 @@
+---
+
+title: Extract Default Non-primitive Parameter Value from Memoized Component to Constant
+impact: MEDIUM
+impactDescription: restores memoization by using a constant for default value
+tags: rerender, memo, optimization
+
+---
+
+## Extract Default Non-primitive Parameter Value from Memoized Component to Constant
+
+When memoized component has a default value for some non-primitive optional parameter, such as an array, function, or object, calling the component without that parameter results in broken memoization. This is because new value instances are created on every rerender, and they do not pass strict equality comparison in `memo()`.
+
+To address this issue, extract the default value into a constant.
+
+**Incorrect (`onClick` has different values on every rerender):**
+
+```tsx
+const UserAvatar = memo(function UserAvatar({ onClick = () => {} }: { onClick?: () => void }) {
+  // ...
+})
+
+// Used without optional onClick
+<UserAvatar />
+```
+
+**Correct (stable default value):**
+
+```tsx
+const NOOP = () => {};
+
+const UserAvatar = memo(function UserAvatar({ onClick = NOOP }: { onClick?: () => void }) {
+  // ...
+})
+
+// Used without optional onClick
+<UserAvatar />
+```
diff --git a/.github/skills/react-best-practices/rules/rerender-memo.md b/.github/skills/react-best-practices/rules/rerender-memo.md
new file mode 100644
index 0000000..f8982ab
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-memo.md
@@ -0,0 +1,44 @@
+---
+title: Extract to Memoized Components
+impact: MEDIUM
+impactDescription: enables early returns
+tags: rerender, memo, useMemo, optimization
+---
+
+## Extract to Memoized Components
+
+Extract expensive work into memoized components to enable early returns before computation.
+
+**Incorrect (computes avatar even when loading):**
+
+```tsx
+function Profile({ user, loading }: Props) {
+  const avatar = useMemo(() => {
+    const id = computeAvatarId(user)
+    return <Avatar id={id} />
+  }, [user])
+
+  if (loading) return <Skeleton />
+  return <div>{avatar}</div>
+}
+```
+
+**Correct (skips computation when loading):**
+
+```tsx
+const UserAvatar = memo(function UserAvatar({ user }: { user: User }) {
+  const id = useMemo(() => computeAvatarId(user), [user])
+  return <Avatar id={id} />
+})
+
+function Profile({ user, loading }: Props) {
+  if (loading) return <Skeleton />
+  return (
+    <div>
+      <UserAvatar user={user} />
+    </div>
+  )
+}
+```
+
+**Note:** If your project has [React Compiler](https://react.dev/learn/react-compiler) enabled, manual memoization with `memo()` and `useMemo()` is not necessary. The compiler automatically optimizes re-renders.
diff --git a/.github/skills/react-best-practices/rules/rerender-move-effect-to-event.md b/.github/skills/react-best-practices/rules/rerender-move-effect-to-event.md
new file mode 100644
index 0000000..dd58a1a
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-move-effect-to-event.md
@@ -0,0 +1,45 @@
+---
+title: Put Interaction Logic in Event Handlers
+impact: MEDIUM
+impactDescription: avoids effect re-runs and duplicate side effects
+tags: rerender, useEffect, events, side-effects, dependencies
+---
+
+## Put Interaction Logic in Event Handlers
+
+If a side effect is triggered by a specific user action (submit, click, drag), run it in that event handler. Do not model the action as state + effect; it makes effects re-run on unrelated changes and can duplicate the action.
+
+**Incorrect (event modeled as state + effect):**
+
+```tsx
+function Form() {
+  const [submitted, setSubmitted] = useState(false)
+  const theme = useContext(ThemeContext)
+
+  useEffect(() => {
+    if (submitted) {
+      post('/api/register')
+      showToast('Registered', theme)
+    }
+  }, [submitted, theme])
+
+  return <button onClick={() => setSubmitted(true)}>Submit</button>
+}
+```
+
+**Correct (do it in the handler):**
+
+```tsx
+function Form() {
+  const theme = useContext(ThemeContext)
+
+  function handleSubmit() {
+    post('/api/register')
+    showToast('Registered', theme)
+  }
+
+  return <button onClick={handleSubmit}>Submit</button>
+}
+```
+
+Reference: [Should this code move to an event handler?](https://react.dev/learn/removing-effect-dependencies#should-this-code-move-to-an-event-handler)
diff --git a/.github/skills/react-best-practices/rules/rerender-no-inline-components.md b/.github/skills/react-best-practices/rules/rerender-no-inline-components.md
new file mode 100644
index 0000000..d97592a
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-no-inline-components.md
@@ -0,0 +1,82 @@
+---
+title: Don't Define Components Inside Components
+impact: HIGH
+impactDescription: prevents remount on every render
+tags: rerender, components, remount, performance
+---
+
+## Don't Define Components Inside Components
+
+**Impact: HIGH (prevents remount on every render)**
+
+Defining a component inside another component creates a new component type on every render. React sees a different component each time and fully remounts it, destroying all state and DOM.
+
+A common reason developers do this is to access parent variables without passing props. Always pass props instead.
+
+**Incorrect (remounts on every render):**
+
+```tsx
+function UserProfile({ user, theme }) {
+  // Defined inside to access `theme` - BAD
+  const Avatar = () => (
+    <img
+      src={user.avatarUrl}
+      className={theme === 'dark' ? 'avatar-dark' : 'avatar-light'}
+    />
+  )
+
+  // Defined inside to access `user` - BAD
+  const Stats = () => (
+    <div>
+      <span>{user.followers} followers</span>
+      <span>{user.posts} posts</span>
+    </div>
+  )
+
+  return (
+    <div>
+      <Avatar />
+      <Stats />
+    </div>
+  )
+}
+```
+
+Every time `UserProfile` renders, `Avatar` and `Stats` are new component types. React unmounts the old instances and mounts new ones, losing any internal state, running effects again, and recreating DOM nodes.
+
+**Correct (pass props instead):**
+
+```tsx
+function Avatar({ src, theme }: { src: string; theme: string }) {
+  return (
+    <img
+      src={src}
+      className={theme === 'dark' ? 'avatar-dark' : 'avatar-light'}
+    />
+  )
+}
+
+function Stats({ followers, posts }: { followers: number; posts: number }) {
+  return (
+    <div>
+      <span>{followers} followers</span>
+      <span>{posts} posts</span>
+    </div>
+  )
+}
+
+function UserProfile({ user, theme }) {
+  return (
+    <div>
+      <Avatar src={user.avatarUrl} theme={theme} />
+      <Stats followers={user.followers} posts={user.posts} />
+    </div>
+  )
+}
+```
+
+**Symptoms of this bug:**
+- Input fields lose focus on every keystroke
+- Animations restart unexpectedly
+- `useEffect` cleanup/setup runs on every parent render
+- Scroll position resets inside the component
diff --git a/.github/skills/react-best-practices/rules/rerender-simple-expression-in-memo.md b/.github/skills/react-best-practices/rules/rerender-simple-expression-in-memo.md
new file mode 100644
index 0000000..59dfab0
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-simple-expression-in-memo.md
@@ -0,0 +1,35 @@
+---
+title: Do not wrap a simple expression with a primitive result type in useMemo
+impact: LOW-MEDIUM
+impactDescription: wasted computation on every render
+tags: rerender, useMemo, optimization
+---
+
+## Do not wrap a simple expression with a primitive result type in useMemo
+
+When an expression is simple (few logical or arithmetical operators) and has a primitive result type (boolean, number, string), do not wrap it in `useMemo`.
+Calling `useMemo` and comparing hook dependencies may consume more resources than the expression itself.
+
+**Incorrect:**
+
+```tsx
+function Header({ user, notifications }: Props) {
+  const isLoading = useMemo(() => {
+    return user.isLoading || notifications.isLoading
+  }, [user.isLoading, notifications.isLoading])
+
+  if (isLoading) return <Skeleton />
+  // return some markup
+}
+```
+
+**Correct:**
+
+```tsx
+function Header({ user, notifications }: Props) {
+  const isLoading = user.isLoading || notifications.isLoading
+
+  if (isLoading) return <Skeleton />
+  // return some markup
+}
+```
diff --git a/.github/skills/react-best-practices/rules/rerender-transitions.md b/.github/skills/react-best-practices/rules/rerender-transitions.md
new file mode 100644
index 0000000..d99f43f
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-transitions.md
@@ -0,0 +1,40 @@
+---
+title: Use Transitions for Non-Urgent Updates
+impact: MEDIUM
+impactDescription: maintains UI responsiveness
+tags: rerender, transitions, startTransition, performance
+---
+
+## Use Transitions for Non-Urgent Updates
+
+Mark frequent, non-urgent state updates as transitions to maintain UI responsiveness.
+
+**Incorrect (blocks UI on every scroll):**
+
+```tsx
+function ScrollTracker() {
+  const [scrollY, setScrollY] = useState(0)
+  useEffect(() => {
+    const handler = () => setScrollY(window.scrollY)
+    window.addEventListener('scroll', handler, { passive: true })
+    return () => window.removeEventListener('scroll', handler)
+  }, [])
+}
+```
+
+**Correct (non-blocking updates):**
+
+```tsx
+import { startTransition } from 'react'
+
+function ScrollTracker() {
+  const [scrollY, setScrollY] = useState(0)
+  useEffect(() => {
+    const handler = () => {
+      startTransition(() => setScrollY(window.scrollY))
+    }
+    window.addEventListener('scroll', handler, { passive: true })
+    return () => window.removeEventListener('scroll', handler)
+  }, [])
+}
+```
diff --git a/.github/skills/react-best-practices/rules/rerender-use-ref-transient-values.md b/.github/skills/react-best-practices/rules/rerender-use-ref-transient-values.md
new file mode 100644
index 0000000..cf04b81
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/rerender-use-ref-transient-values.md
@@ -0,0 +1,73 @@
+---
+title: Use useRef for Transient Values
+impact: MEDIUM
+impactDescription: avoids unnecessary re-renders on frequent updates
+tags: rerender, useref, state, performance
+---
+
+## Use useRef for Transient Values
+
+When a value changes frequently and you don't want a re-render on every update (e.g., mouse trackers, intervals, transient flags), store it in `useRef` instead of `useState`. Keep component state for UI; use refs for temporary DOM-adjacent values. Updating a ref does not trigger a re-render.
+
+**Incorrect (renders every update):**
+
+```tsx
+function Tracker() {
+  const [lastX, setLastX] = useState(0)
+
+  useEffect(() => {
+    const onMove = (e: MouseEvent) => setLastX(e.clientX)
+    window.addEventListener('mousemove', onMove)
+    return () => window.removeEventListener('mousemove', onMove)
+  }, [])
+
+  return (
+    <div
+      style={{
+        position: 'fixed',
+        top: 0,
+        left: lastX,
+        width: 8,
+        height: 8,
+        background: 'black',
+      }}
+    />
+  )
+}
+```
+
+**Correct (no re-render for tracking):**
+
+```tsx
+function Tracker() {
+  const lastXRef = useRef(0)
+  const dotRef = useRef<HTMLDivElement>(null)
+
+  useEffect(() => {
+    const onMove = (e: MouseEvent) => {
+      lastXRef.current = e.clientX
+      const node = dotRef.current
+      if (node) {
+        node.style.transform = `translateX(${e.clientX}px)`
+      }
+    }
+    window.addEventListener('mousemove', onMove)
+    return () => window.removeEventListener('mousemove', onMove)
+  }, [])
+
+  return (
+    <div
+      ref={dotRef}
+      style={{
+        position: 'fixed',
+        top: 0,
+        left: 0,
+        width: 8,
+        height: 8,
+        background: 'black',
+        transform: 'translateX(0px)',
+      }}
+    />
+  )
+}
+```
diff --git a/.github/skills/react-best-practices/rules/server-after-nonblocking.md b/.github/skills/react-best-practices/rules/server-after-nonblocking.md
new file mode 100644
index 0000000..e8f5b26
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-after-nonblocking.md
@@ -0,0 +1,73 @@
+---
+title: Use after() for Non-Blocking Operations
+impact: MEDIUM
+impactDescription: faster response times
+tags: server, async, logging, analytics, side-effects
+---
+
+## Use after() for Non-Blocking Operations
+
+Use Next.js's `after()` to schedule work that should execute after a response is sent. This prevents logging, analytics, and other side effects from blocking the response.
+
+**Incorrect (blocks response):**
+
+```tsx
+import { logUserAction } from '@/app/utils'
+
+export async function POST(request: Request) {
+  // Perform mutation
+  await updateDatabase(request)
+  
+  // Logging blocks the response
+  const userAgent = request.headers.get('user-agent') || 'unknown'
+  await logUserAction({ userAgent })
+  
+  return new Response(JSON.stringify({ status: 'success' }), {
+    status: 200,
+    headers: { 'Content-Type': 'application/json' }
+  })
+}
+```
+
+**Correct (non-blocking):**
+
+```tsx
+import { after } from 'next/server'
+import { headers, cookies } from 'next/headers'
+import { logUserAction } from '@/app/utils'
+
+export async function POST(request: Request) {
+  // Perform mutation
+  await updateDatabase(request)
+  
+  // Log after response is sent
+  after(async () => {
+    const userAgent = (await headers()).get('user-agent') || 'unknown'
+    const sessionCookie = (await cookies()).get('session-id')?.value || 'anonymous'
+    
+    logUserAction({ sessionCookie, userAgent })
+  })
+  
+  return new Response(JSON.stringify({ status: 'success' }), {
+    status: 200,
+    headers: { 'Content-Type': 'application/json' }
+  })
+}
+```
+
+The response is sent immediately while logging happens in the background.
+
+**Common use cases:**
+
+- Analytics tracking
+- Audit logging
+- Sending notifications
+- Cache invalidation
+- Cleanup tasks
+
+**Important notes:**
+
+- `after()` runs even if the response fails or redirects
+- Works in Server Actions, Route Handlers, and Server Components
+
+Reference: [https://nextjs.org/docs/app/api-reference/functions/after](https://nextjs.org/docs/app/api-reference/functions/after)
diff --git a/.github/skills/react-best-practices/rules/server-auth-actions.md b/.github/skills/react-best-practices/rules/server-auth-actions.md
new file mode 100644
index 0000000..ee82c04
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-auth-actions.md
@@ -0,0 +1,96 @@
+---
+title: Authenticate Server Actions Like API Routes
+impact: CRITICAL
+impactDescription: prevents unauthorized access to server mutations
+tags: server, server-actions, authentication, security, authorization
+---
+
+## Authenticate Server Actions Like API Routes
+
+**Impact: CRITICAL (prevents unauthorized access to server mutations)**
+
+Server Actions (functions with `"use server"`) are exposed as public endpoints, just like API routes. Always verify authentication and authorization **inside** each Server Action—do not rely solely on middleware, layout guards, or page-level checks, as Server Actions can be invoked directly.
+
+Next.js documentation explicitly states: "Treat Server Actions with the same security considerations as public-facing API endpoints, and verify if the user is allowed to perform a mutation."
+
+**Incorrect (no authentication check):**
+
+```typescript
+'use server'
+
+export async function deleteUser(userId: string) {
+  // Anyone can call this! No auth check
+  await db.user.delete({ where: { id: userId } })
+  return { success: true }
+}
+```
+
+**Correct (authentication inside the action):**
+
+```typescript
+'use server'
+
+import { verifySession } from '@/lib/auth'
+import { unauthorized } from '@/lib/errors'
+
+export async function deleteUser(userId: string) {
+  // Always check auth inside the action
+  const session = await verifySession()
+  
+  if (!session) {
+    throw unauthorized('Must be logged in')
+  }
+  
+  // Check authorization too
+  if (session.user.role !== 'admin' && session.user.id !== userId) {
+    throw unauthorized('Cannot delete other users')
+  }
+  
+  await db.user.delete({ where: { id: userId } })
+  return { success: true }
+}
+```
+
+**With input validation:**
+
+```typescript
+'use server'
+
+import { verifySession } from '@/lib/auth'
+import { z } from 'zod'
+
+const updateProfileSchema = z.object({
+  userId: z.string().uuid(),
+  name: z.string().min(1).max(100),
+  email: z.string().email()
+})
+
+export async function updateProfile(data: unknown) {
+  // Validate input first
+  const validated = updateProfileSchema.parse(data)
+  
+  // Then authenticate
+  const session = await verifySession()
+  if (!session) {
+    throw new Error('Unauthorized')
+  }
+  
+  // Then authorize
+  if (session.user.id !== validated.userId) {
+    throw new Error('Can only update own profile')
+  }
+  
+  // Finally perform the mutation
+  await db.user.update({
+    where: { id: validated.userId },
+    data: {
+      name: validated.name,
+      email: validated.email
+    }
+  })
+  
+  return { success: true }
+}
+```
+
+Reference: [https://nextjs.org/docs/app/guides/authentication](https://nextjs.org/docs/app/guides/authentication)
diff --git a/.github/skills/react-best-practices/rules/server-cache-lru.md b/.github/skills/react-best-practices/rules/server-cache-lru.md
new file mode 100644
index 0000000..ef6938a
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-cache-lru.md
@@ -0,0 +1,41 @@
+---
+title: Cross-Request LRU Caching
+impact: HIGH
+impactDescription: caches across requests
+tags: server, cache, lru, cross-request
+---
+
+## Cross-Request LRU Caching
+
+`React.cache()` only works within one request. For data shared across sequential requests (user clicks button A then button B), use an LRU cache.
+
+**Implementation:**
+
+```typescript
+import { LRUCache } from 'lru-cache'
+
+const cache = new LRUCache<string, any>({
+  max: 1000,
+  ttl: 5 * 60 * 1000  // 5 minutes
+})
+
+export async function getUser(id: string) {
+  const cached = cache.get(id)
+  if (cached) return cached
+
+  const user = await db.user.findUnique({ where: { id } })
+  cache.set(id, user)
+  return user
+}
+
+// Request 1: DB query, result cached
+// Request 2: cache hit, no DB query
+```
+
+Use when sequential user actions hit multiple endpoints needing the same data within seconds.
+
+**With Vercel's [Fluid Compute](https://vercel.com/docs/fluid-compute):** LRU caching is especially effective because multiple concurrent requests can share the same function instance and cache. This means the cache persists across requests without needing external storage like Redis.
+
+**In traditional serverless:** Each invocation runs in isolation, so consider Redis for cross-process caching.
+
+Reference: [https://github.com/isaacs/node-lru-cache](https://github.com/isaacs/node-lru-cache)
diff --git a/.github/skills/react-best-practices/rules/server-cache-react.md b/.github/skills/react-best-practices/rules/server-cache-react.md
new file mode 100644
index 0000000..87c9ca3
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-cache-react.md
@@ -0,0 +1,76 @@
+---
+title: Per-Request Deduplication with React.cache()
+impact: MEDIUM
+impactDescription: deduplicates within request
+tags: server, cache, react-cache, deduplication
+---
+
+## Per-Request Deduplication with React.cache()
+
+Use `React.cache()` for server-side request deduplication. Authentication and database queries benefit most.
+
+**Usage:**
+
+```typescript
+import { cache } from 'react'
+
+export const getCurrentUser = cache(async () => {
+  const session = await auth()
+  if (!session?.user?.id) return null
+  return await db.user.findUnique({
+    where: { id: session.user.id }
+  })
+})
+```
+
+Within a single request, multiple calls to `getCurrentUser()` execute the query only once.
+
+**Avoid inline objects as arguments:**
+
+`React.cache()` uses shallow equality (`Object.is`) to determine cache hits. Inline objects create new references each call, preventing cache hits.
+
+**Incorrect (always cache miss):**
+
+```typescript
+const getUser = cache(async (params: { uid: number }) => {
+  return await db.user.findUnique({ where: { id: params.uid } })
+})
+
+// Each call creates new object, never hits cache
+getUser({ uid: 1 })
+getUser({ uid: 1 })  // Cache miss, runs query again
+```
+
+**Correct (cache hit):**
+
+```typescript
+const getUser = cache(async (uid: number) => {
+  return await db.user.findUnique({ where: { id: uid } })
+})
+
+// Primitive args use value equality
+getUser(1)
+getUser(1)  // Cache hit, returns cached result
+```
+
+If you must pass objects, pass the same reference:
+
+```typescript
+const params = { uid: 1 }
+getUser(params)  // Query runs
+getUser(params)  // Cache hit (same reference)
+```
+
+**Next.js-Specific Note:**
+
+In Next.js, the `fetch` API is automatically extended with request memoization. Requests with the same URL and options are automatically deduplicated within a single request, so you don't need `React.cache()` for `fetch` calls. However, `React.cache()` is still essential for other async tasks:
+
+- Database queries (Prisma, Drizzle, etc.)
+- Heavy computations
+- Authentication checks
+- File system operations
+- Any non-fetch async work
+
+Use `React.cache()` to deduplicate these operations across your component tree.
+
+Reference: [React.cache documentation](https://react.dev/reference/react/cache)
diff --git a/.github/skills/react-best-practices/rules/server-dedup-props.md b/.github/skills/react-best-practices/rules/server-dedup-props.md
new file mode 100644
index 0000000..fb24a25
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-dedup-props.md
@@ -0,0 +1,65 @@
+---
+title: Avoid Duplicate Serialization in RSC Props
+impact: LOW
+impactDescription: reduces network payload by avoiding duplicate serialization
+tags: server, rsc, serialization, props, client-components
+---
+
+## Avoid Duplicate Serialization in RSC Props
+
+**Impact: LOW (reduces network payload by avoiding duplicate serialization)**
+
+RSC→client serialization deduplicates by object reference, not value. Same reference = serialized once; new reference = serialized again. Do transformations (`.toSorted()`, `.filter()`, `.map()`) in client, not server.
+
+**Incorrect (duplicates array):**
+
+```tsx
+// RSC: sends 6 strings (2 arrays × 3 items)
+<ClientList usernames={usernames} usernamesOrdered={usernames.toSorted()} />
+```
+
+**Correct (sends 3 strings):**
+
+```tsx
+// RSC: send once
+<ClientList usernames={usernames} />
+
+// Client: transform there
+'use client'
+const sorted = useMemo(() => [...usernames].sort(), [usernames])
+```
+
+**Nested deduplication behavior:**
+
+Deduplication works recursively. Impact varies by data type:
+
+- `string[]`, `number[]`, `boolean[]`: **HIGH impact** - array + all primitives fully duplicated
+- `object[]`: **LOW impact** - array duplicated, but nested objects deduplicated by reference
+
+```tsx
+// string[] - duplicates everything
+usernames={['a','b']} sorted={usernames.toSorted()} // sends 4 strings
+
+// object[] - duplicates array structure only
+users={[{id:1},{id:2}]} sorted={users.toSorted()} // sends 2 arrays + 2 unique objects (not 4)
+```
+
+**Operations breaking deduplication (create new references):**
+
+- Arrays: `.toSorted()`, `.filter()`, `.map()`, `.slice()`, `[...arr]`
+- Objects: `{...obj}`, `Object.assign()`, `structuredClone()`, `JSON.parse(JSON.stringify())`
+
+**More examples:**
+
+```tsx
+// ❌ Bad
+<C users={users} active={users.filter(u => u.active)} />
+<C product={product} productName={product.name} />
+
+// ✅ Good
+<C users={users} />
+<C product={product} />
+// Do filtering/destructuring in client
+```
+
+**Exception:** Pass derived data when transformation is expensive or client doesn't need original.
diff --git a/.github/skills/react-best-practices/rules/server-hoist-static-io.md b/.github/skills/react-best-practices/rules/server-hoist-static-io.md
new file mode 100644
index 0000000..5b642b6
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-hoist-static-io.md
@@ -0,0 +1,142 @@
+---
+title: Hoist Static I/O to Module Level
+impact: HIGH
+impactDescription: avoids repeated file/network I/O per request
+tags: server, io, performance, next.js, route-handlers, og-image
+---
+
+## Hoist Static I/O to Module Level
+
+**Impact: HIGH (avoids repeated file/network I/O per request)**
+
+When loading static assets (fonts, logos, images, config files) in route handlers or server functions, hoist the I/O operation to module level. Module-level code runs once when the module is first imported, not on every request. This eliminates redundant file system reads or network fetches that would otherwise run on every invocation.
+
+**Incorrect: reads font file on every request**
+
+```typescript
+// app/api/og/route.tsx
+import { ImageResponse } from 'next/og'
+
+export async function GET(request: Request) {
+  // Runs on EVERY request - expensive!
+  const fontData = await fetch(
+    new URL('./fonts/Inter.ttf', import.meta.url)
+  ).then(res => res.arrayBuffer())
+  
+  const logoData = await fetch(
+    new URL('./images/logo.png', import.meta.url)
+  ).then(res => res.arrayBuffer())
+
+  return new ImageResponse(
+    <div style={{ fontFamily: 'Inter' }}>
+      <img src={logoData} />
+      Hello World
+    </div>,
+    { fonts: [{ name: 'Inter', data: fontData }] }
+  )
+}
+```
+
+**Correct: loads once at module initialization**
+
+```typescript
+// app/api/og/route.tsx
+import { ImageResponse } from 'next/og'
+
+// Module-level: runs ONCE when module is first imported
+const fontData = fetch(
+  new URL('./fonts/Inter.ttf', import.meta.url)
+).then(res => res.arrayBuffer())
+
+const logoData = fetch(
+  new URL('./images/logo.png', import.meta.url)
+).then(res => res.arrayBuffer())
+
+export async function GET(request: Request) {
+  // Await the already-started promises
+  const [font, logo] = await Promise.all([fontData, logoData])
+
+  return new ImageResponse(
+    <div style={{ fontFamily: 'Inter' }}>
+      <img src={logo} />
+      Hello World
+    </div>,
+    { fonts: [{ name: 'Inter', data: font }] }
+  )
+}
+```
+
+**Alternative: synchronous file reads with Node.js fs**
+
+```typescript
+// app/api/og/route.tsx
+import { ImageResponse } from 'next/og'
+import { readFileSync } from 'fs'
+import { join } from 'path'
+
+// Synchronous read at module level - blocks only during module init
+const fontData = readFileSync(
+  join(process.cwd(), 'public/fonts/Inter.ttf')
+)
+
+const logoData = readFileSync(
+  join(process.cwd(), 'public/images/logo.png')
+)
+
+export async function GET(request: Request) {
+  return new ImageResponse(
+    <div style={{ fontFamily: 'Inter' }}>
+      <img src={logoData} />
+      Hello World
+    </div>,
+    { fonts: [{ name: 'Inter', data: fontData }] }
+  )
+}
+```
+
+**General Node.js example: loading config or templates**
+
+```typescript
+// Incorrect: reads config on every call
+export async function processRequest(data: Data) {
+  const config = JSON.parse(
+    await fs.readFile('./config.json', 'utf-8')
+  )
+  const template = await fs.readFile('./template.html', 'utf-8')
+  
+  return render(template, data, config)
+}
+
+// Correct: loads once at module level
+const configPromise = fs.readFile('./config.json', 'utf-8')
+  .then(JSON.parse)
+const templatePromise = fs.readFile('./template.html', 'utf-8')
+
+export async function processRequest(data: Data) {
+  const [config, template] = await Promise.all([
+    configPromise,
+    templatePromise
+  ])
+  
+  return render(template, data, config)
+}
+```
+
+**When to use this pattern:**
+
+- Loading fonts for OG image generation
+- Loading static logos, icons, or watermarks
+- Reading configuration files that don't change at runtime
+- Loading email templates or other static templates
+- Any static asset that's the same across all requests
+
+**When NOT to use this pattern:**
+
+- Assets that vary per request or user
+- Files that may change during runtime (use caching with TTL instead)
+- Large files that would consume too much memory if kept loaded
+- Sensitive data that shouldn't persist in memory
+
+**With Vercel's [Fluid Compute](https://vercel.com/docs/fluid-compute):** Module-level caching is especially effective because multiple concurrent requests share the same function instance. The static assets stay loaded in memory across requests without cold start penalties.
+
+**In traditional serverless:** Each cold start re-executes module-level code, but subsequent warm invocations reuse the loaded assets until the instance is recycled.
diff --git a/.github/skills/react-best-practices/rules/server-parallel-fetching.md b/.github/skills/react-best-practices/rules/server-parallel-fetching.md
new file mode 100644
index 0000000..1affc83
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-parallel-fetching.md
@@ -0,0 +1,83 @@
+---
+title: Parallel Data Fetching with Component Composition
+impact: CRITICAL
+impactDescription: eliminates server-side waterfalls
+tags: server, rsc, parallel-fetching, composition
+---
+
+## Parallel Data Fetching with Component Composition
+
+React Server Components execute sequentially within a tree. Restructure with composition to parallelize data fetching.
+
+**Incorrect (Sidebar waits for Page's fetch to complete):**
+
+```tsx
+export default async function Page() {
+  const header = await fetchHeader()
+  return (
+    <div>
+      <div>{header}</div>
+      <Sidebar />
+    </div>
+  )
+}
+
+async function Sidebar() {
+  const items = await fetchSidebarItems()
+  return <nav>{items.map(renderItem)}</nav>
+}
+```
+
+**Correct (both fetch simultaneously):**
+
+```tsx
+async function Header() {
+  const data = await fetchHeader()
+  return <div>{data}</div>
+}
+
+async function Sidebar() {
+  const items = await fetchSidebarItems()
+  return <nav>{items.map(renderItem)}</nav>
+}
+
+export default function Page() {
+  return (
+    <div>
+      <Header />
+      <Sidebar />
+    </div>
+  )
+}
+```
+
+**Alternative with children prop:**
+
+```tsx
+async function Header() {
+  const data = await fetchHeader()
+  return <div>{data}</div>
+}
+
+async function Sidebar() {
+  const items = await fetchSidebarItems()
+  return <nav>{items.map(renderItem)}</nav>
+}
+
+function Layout({ children }: { children: ReactNode }) {
+  return (
+    <div>
+      <Header />
+      {children}
+    </div>
+  )
+}
+
+export default function Page() {
+  return (
+    <Layout>
+      <Sidebar />
+    </Layout>
+  )
+}
+```
diff --git a/.github/skills/react-best-practices/rules/server-serialization.md b/.github/skills/react-best-practices/rules/server-serialization.md
new file mode 100644
index 0000000..39c5c41
--- /dev/null
+++ b/.github/skills/react-best-practices/rules/server-serialization.md
@@ -0,0 +1,38 @@
+---
+title: Minimize Serialization at RSC Boundaries
+impact: HIGH
+impactDescription: reduces data transfer size
+tags: server, rsc, serialization, props
+---
+
+## Minimize Serialization at RSC Boundaries
+
+The React Server/Client boundary serializes all object properties into strings and embeds them in the HTML response and subsequent RSC requests. This serialized data directly impacts page weight and load time, so **size matters a lot**. Only pass fields that the client actually uses.
+
+**Incorrect (serializes all 50 fields):**
+
+```tsx
+async function Page() {
+  const user = await fetchUser()  // 50 fields
+  return <Profile user={user} />
+}
+
+'use client'
+function Profile({ user }: { user: User }) {
+  return <div>{user.name}</div>  // uses 1 field
+}
+```
+
+**Correct (serializes only 1 field):**
+
+```tsx
+async function Page() {
+  const user = await fetchUser()
+  return <Profile name={user.name} />
+}
+
+'use client'
+function Profile({ name }: { name: string }) {
+  return <div>{name}</div>
+}
+```
diff --git a/.github/workflows/gtc-ci.yml b/.github/workflows/gtc-ci.yml
index cbbc11f..c518fe4 100644
--- a/.github/workflows/gtc-ci.yml
+++ b/.github/workflows/gtc-ci.yml
@@ -72,15 +72,20 @@ jobs:
           restore-keys: |
             uv-${{ runner.os }}-py311-
 
-      - name: Install backend dependencies (dev)
-        working-directory: backend
-        run: uv sync --frozen
+      - name: Setup Node.js 20
+        uses: actions/setup-node@v4
+        with:
+          node-version: "20"
+          cache: "npm"
+          cache-dependency-path: frontend/package-lock.json
+
+      - name: Install dependencies through harness
+        run: make -f Makefile.harness setup
 
-      - name: Backend unit tests
-        working-directory: backend
+      - name: Run app CI through harness
         env:
-          GTC_LLM_ENABLED: "False"
-        run: uv run pytest -q tests/unit -v --junitxml=pytest-unit-results.xml --cov=backend/src --cov-report=xml --cov-report=term 
+          HARNESS_TY_OUTPUT_FORMAT: github
+        run: make -f Makefile.harness ci
 
       - name: Publish test results
         uses: EnricoMi/publish-unit-test-result-action@v2
@@ -90,32 +95,18 @@ jobs:
           check_name: "Backend Unit Test Results"
           comment_title: "GTC Backend Unit Test Results"
 
-      - name: Wait for Cosmos DB Emulator to be ready
-        run: |
-          for i in {1..60}; do
-            if curl -ksSf https://localhost:8081/_explorer/emulator.pem >/dev/null 2>&1 || curl -sS --max-time 2 http://localhost:8081/ >/dev/null 2>&1; then
-              echo "Emulator is ready"
-              break
-            fi
-            if [ "$i" -eq 60 ]; then
-              echo "Emulator not ready in time" >&2
-              # Dump container logs to help diagnose
-              docker logs $(docker ps -aqf "name=cosmos") || true
-              exit 1
-            fi
-            sleep 2
-          done
-
-      - name: Load integration test environment
-        run: |
-          grep -v '^[[:space:]]*#' backend/environments/integration-tests.env | sed 's/\r$//' >> "$GITHUB_ENV"
+      - name: Publish frontend test results
+        uses: EnricoMi/publish-unit-test-result-action@v2
+        if: always()
+        with:
+          files: frontend/junit-results.xml
+          check_name: "Frontend Test Results"
+          comment_title: "GTC Frontend Test Results"
 
-      - name: Backend integration tests (Cosmos)
-        working-directory: backend
+      - name: Backend integration tests (Cosmos) through harness
         env:
-          GTC_LLM_ENABLED: "False"
-        run: |
-          uv run pytest -q tests/integration -v --junitxml=pytest-int-results.xml --cov=backend/src --cov-report=xml --cov-report=term
+          HARNESS_COSMOS_READY_URL: http://localhost:8081
+        run: make -f Makefile.harness backend-integration-test
 
       - name: Publish test results
         uses: EnricoMi/publish-unit-test-result-action@v2
@@ -125,47 +116,6 @@ jobs:
           check_name: "Backend Integration Test Results"
           comment_title: "GTC Backend Integration Test Results"
 
-      - name: ty (backend/app)
-        working-directory: backend
-        run: uv run ty check app --output-format github --exclude app/adapters/inference/inference.py --force-exclude
-
-      - name: Setup Node.js 20
-        uses: actions/setup-node@v4
-        with:
-          node-version: "20"
-          cache: "npm"
-          cache-dependency-path: frontend/package-lock.json
-
-      - name: Install frontend dependencies
-        working-directory: frontend
-        run: npm ci --no-audit --no-fund
-
-      - name: Export OpenAPI spec
-        working-directory: backend
-        run: uv run python scripts/export_openapi.py
-
-      - name: Check OpenAPI spec is up to date
-        working-directory: frontend
-        run: |
-          git --no-pager diff --exit-code -- src/api/openapi.json || (echo 'OpenAPI spec is out of date. Run: cd backend && uv run python scripts/export_openapi.py' && exit 1)
-
-      
-      - name: Check frontend OpenAPI types
-        working-directory: frontend
-        run: npm run api:types:check
-
-      - name: Build frontend
-        working-directory: frontend
-        run: npm run test:run
-
-      - name: Publish frontend test results
-        uses: EnricoMi/publish-unit-test-result-action@v2
-        if: always()
-        with:
-          files: frontend/junit-results.xml
-          check_name: "Frontend Test Results"
-          comment_title: "GTC Frontend Test Results"
-
   build-and-push:
     environment: dev
     name: Build & push backend image
diff --git a/.github/workflows/harness.yml b/.github/workflows/harness.yml
new file mode 100644
index 0000000..ed302b6
--- /dev/null
+++ b/.github/workflows/harness.yml
@@ -0,0 +1,62 @@
+name: Harness CI
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'AGENTS.md'
+      - 'Makefile.harness'
+      - 'docs/ARCHITECTURE.md'
+      - 'docs/OBSERVABILITY.md'
+      - 'scripts/audit_harness.sh'
+      - 'scripts/verify_customized.sh'
+      - 'scripts/harness/**'
+      - '.github/workflows/harness.yml'
+      - 'backend/**'
+      - 'frontend/**'
+  pull_request:
+    paths:
+      - 'AGENTS.md'
+      - 'Makefile.harness'
+      - 'docs/ARCHITECTURE.md'
+      - 'docs/OBSERVABILITY.md'
+      - 'scripts/audit_harness.sh'
+      - 'scripts/verify_customized.sh'
+      - 'scripts/harness/**'
+      - '.github/workflows/harness.yml'
+      - 'backend/**'
+      - 'frontend/**'
+
+permissions:
+  contents: read
+
+jobs:
+  harness:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Setup Python and uv
+        uses: astral-sh/setup-uv@v7
+        with:
+          python-version: '3.11'
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: '20'
+          cache: 'npm'
+          cache-dependency-path: frontend/package-lock.json
+
+      - name: Install dependencies through harness
+        run: make -f Makefile.harness setup
+
+      - name: Audit harness artifacts
+        run: ./scripts/audit_harness.sh .
+
+      - name: Verify customization
+        run: ./scripts/verify_customized.sh .
+
+      - name: Run harness pipeline
+        run: make -f Makefile.harness ci
diff --git a/.gitignore b/.gitignore
index 30dcbb5..5e5bc30 100644
--- a/.gitignore
+++ b/.gitignore
@@ -45,6 +45,9 @@ build/
 # Based on common Node.gitignore defaults
 # ------------------------------
 .vitest/
+.playwright/
+frontend/playwright-report/
+frontend/test-results/
 .eslintcache
 *.tsbuildinfo
 
@@ -53,15 +56,22 @@ build/
 # ------------------------------
 .vscode/*
 !.vscode/extensions.json
+!.vscode/settings.json
 
 **/.vscode/*
 !**/.vscode/tasks.json
 !**/.vscode/launch.json
 !**/.vscode/extensions.json
+!**/.vscode/settings.json
 
 .idea/
 *.iml
 site/
 
 .playwright-cli/
-.copilot-tracking/
\ No newline at end of file
+.copilot-tracking/.harness/
+.harness/
+.copilot-tracking/*
+
+# Redacted trace exports (may contain PII)
+wireframes/trace_export_redacted.json
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
new file mode 100644
index 0000000..b9e6664
--- /dev/null
+++ b/.gitlab-ci.yml
@@ -0,0 +1,16 @@
+workflow:
+  rules:
+    - if: '$DEPLOY_BACKEND == "true" && $CI_COMMIT_REF_PROTECTED == "true" && ($CI_PIPELINE_SOURCE == "web" || $CI_PIPELINE_SOURCE == "api" || $CI_PIPELINE_SOURCE == "trigger" || $CI_PIPELINE_SOURCE == "pipeline")'
+    - if: '$DEPLOY_BACKEND == "true"'
+      when: never
+    - when: always
+
+include:
+  - local: '.gitlab/ci/ci.yml'
+    rules:
+      - if: '$DEPLOY_BACKEND == "true" && $CI_COMMIT_REF_PROTECTED == "true" && ($CI_PIPELINE_SOURCE == "web" || $CI_PIPELINE_SOURCE == "api" || $CI_PIPELINE_SOURCE == "trigger" || $CI_PIPELINE_SOURCE == "pipeline")'
+        when: never
+      - when: always
+  - local: '.gitlab/ci/deploy.yml'
+    rules:
+      - if: '$DEPLOY_BACKEND == "true" && $CI_COMMIT_REF_PROTECTED == "true" && ($CI_PIPELINE_SOURCE == "web" || $CI_PIPELINE_SOURCE == "api" || $CI_PIPELINE_SOURCE == "trigger" || $CI_PIPELINE_SOURCE == "pipeline")'
diff --git a/.gitlab/ci/ci.yml b/.gitlab/ci/ci.yml
new file mode 100644
index 0000000..4937371
--- /dev/null
+++ b/.gitlab/ci/ci.yml
@@ -0,0 +1,86 @@
+variables:
+  UV_CACHE_DIR: "$CI_PROJECT_DIR/.cache/uv"
+  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
+  NPM_CONFIG_CACHE: "$CI_PROJECT_DIR/.cache/npm"
+  GTC_LLM_ENABLED: "False"
+
+stages:
+  - test
+  - integration
+  - docs
+
+default:
+  image: node:20-bookworm
+  cache:
+    key:
+      files:
+        - backend/uv.lock
+        - frontend/package-lock.json
+    paths:
+      - .cache/pip/
+      - .cache/uv/
+      - .cache/npm/
+  before_script:
+    - apt-get update
+    - apt-get install -y --no-install-recommends python3 python3-pip make curl jq
+    - rm -rf /var/lib/apt/lists/*
+    - python3 -m pip install --break-system-packages uv
+    - make -f Makefile.harness setup
+
+app-ci:
+  stage: test
+  script:
+    - make -f Makefile.harness ci
+  artifacts:
+    when: always
+    reports:
+      junit:
+        - backend/pytest-unit-results.xml
+        - frontend/junit-results.xml
+    paths:
+      - .harness/logs.jsonl
+      - .harness/traces.jsonl
+
+backend-integration:
+  stage: integration
+  services:
+    - name: mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:vnext-EN20251022
+      alias: cosmos
+      variables:
+        AZURE_COSMOS_EMULATOR_ENABLE_HTTP_API: "true"
+        AZURE_COSMOS_EMULATOR_PARTITION_COUNT: "3"
+  variables:
+    GTC_COSMOS_ENDPOINT: "http://cosmos:8081"
+    HARNESS_COSMOS_READY_URL: "http://cosmos:8081"
+  script:
+    - make -f Makefile.harness backend-integration-test
+  artifacts:
+    when: always
+    reports:
+      junit: backend/pytest-int-results.xml
+
+docs-pages:
+  stage: docs
+  image: python:3.11-bookworm
+  pages: true
+  before_script: []
+  script:
+    - python -m pip install uv
+    - cd backend && uv sync --frozen
+    - cd backend && uv run mkdocs build -f ../mkdocs.yml --site-dir ../public
+  artifacts:
+    paths:
+      - public/
+  environment:
+    name: docs
+    url: $CI_PAGES_URL
+  rules:
+    - if: '$CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH'
+      changes:
+        - docs/**/*
+        - mkdocs.yml
+        - backend/pyproject.toml
+        - backend/uv.lock
+        - .gitlab-ci.yml
+        - .gitlab/ci/ci.yml
+    - when: never
diff --git a/.gitlab/ci/deploy.yml b/.gitlab/ci/deploy.yml
new file mode 100644
index 0000000..af472ea
--- /dev/null
+++ b/.gitlab/ci/deploy.yml
@@ -0,0 +1,80 @@
+stages:
+  - validate
+  - deploy
+
+validate-deploy-inputs:
+  stage: validate
+  inherit:
+    default: false
+  image: alpine:3.21
+  needs: []
+  rules:
+    - if: '$DEPLOY_BACKEND == "true" && $CI_COMMIT_REF_PROTECTED == "true"'
+    - when: never
+  script:
+    - |
+      set -eu
+
+      if [ "${DEPLOY_BACKEND:-}" != "true" ]; then
+        echo "DEPLOY_BACKEND must be set to true for deploy pipelines." >&2
+        exit 1
+      fi
+
+      if [ -z "${DEPLOYMENT_TARGET_NAME:-}" ]; then
+        echo "DEPLOYMENT_TARGET_NAME is required when DEPLOY_BACKEND=true." >&2
+        exit 1
+      fi
+
+      case "$DEPLOYMENT_TARGET_NAME" in
+        *[!A-Za-z0-9._-]*)
+          echo "DEPLOYMENT_TARGET_NAME may only contain letters, numbers, dot, underscore, or hyphen." >&2
+          exit 1
+          ;;
+      esac
+
+      if [ "${CI_COMMIT_REF_PROTECTED:-false}" != "true" ]; then
+        echo "Deploy pipelines must run on a protected ref." >&2
+        exit 1
+      fi
+
+      if [ -n "${TAG_NAME:-}" ] && [ -n "${CONTAINER_IMAGE:-}" ]; then
+        echo "Set either TAG_NAME or CONTAINER_IMAGE, not both." >&2
+        exit 1
+      fi
+
+      if [ -z "${TAG_NAME:-}${CONTAINER_IMAGE:-}" ]; then
+        echo "Set TAG_NAME or CONTAINER_IMAGE when DEPLOY_BACKEND=true." >&2
+        exit 1
+      fi
+
+      if [ -n "${TAG_NAME:-}" ] && [ -z "${REGISTRY_PREFIX:-}" ]; then
+        echo "REGISTRY_PREFIX is required when TAG_NAME is used." >&2
+        exit 1
+      fi
+
+      if [ -n "${CONTAINER_IMAGE:-}" ] && ! printf '%s' "$CONTAINER_IMAGE" | grep -q '/'; then
+        echo "CONTAINER_IMAGE must include a registry hostname and image path." >&2
+        exit 1
+      fi
+
+deploy-backend:
+  stage: deploy
+  inherit:
+    default: false
+  image: mcr.microsoft.com/azure-cli:2.64.0
+  allow_failure: false
+  needs:
+    - validate-deploy-inputs
+  rules:
+    - if: '$DEPLOY_BACKEND == "true" && $CI_COMMIT_REF_PROTECTED == "true"'
+      when: manual
+    - when: never
+  before_script:
+    - tdnf install -y make
+    - az config set core.only_show_errors=yes
+    - az login --service-principal --username "$AZURE_CLIENT_ID" --password "$AZURE_CLIENT_SECRET" --tenant "$AZURE_TENANT_ID"
+    - az account set --subscription "$AZURE_SUBSCRIPTION_ID"
+  script:
+    - make -f Makefile.harness deploy
+  environment:
+    name: $DEPLOYMENT_TARGET_NAME
diff --git a/AGENTS.md b/AGENTS.md
index 584235d..691232f 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,109 +1,157 @@
-# Agent Instructions
+# AGENTS.md
 
-## Testing and Build Commands
+Ground Truth Curator is a monorepo for curating high-quality ground truth datasets for agent evaluation and model accuracy measurement. The backend is FastAPI/Python, the frontend is React/TypeScript, and the production data plane is centered on Azure services such as Cosmos DB, Blob Storage, Search, and Azure AI.
 
-### Backend (Python with uv)
+## Project Overview
 
-```bash
-cd backend
+- Project: `Ground Truth Curator`
+- Primary runtimes: Python 3.11 (`backend/`) and Node.js 20 (`frontend/`)
+- Main entrypoints: backend FastAPI app at `backend/app/main.py`, frontend Vite app at `frontend/src/main.tsx`
+
+## Harness Commands
+
+Run from repository root:
 
-# Run all unit tests
-uv run pytest tests/unit/ -v
+| Goal | Command |
+|---|---|
+| Install dependencies | `make -f Makefile.harness setup` |
+| Start backend dev server | `make -f Makefile.harness backend` |
+| Start frontend dev server | `make -f Makefile.harness frontend` |
+| Start both dev servers (foreground) | `make -f Makefile.harness dev` |
+| Start both dev servers (background) | `make -f Makefile.harness dev-up` |
+| Stop background dev servers | `make -f Makefile.harness dev-down` |
+| Auto-format code | `make -f Makefile.harness format` |
+| Fast sanity check | `make -f Makefile.harness smoke` |
+| API contract and generated types | `make -f Makefile.harness api-check` |
+| Static checks | `make -f Makefile.harness check` |
+| Full test suite | `make -f Makefile.harness test` |
+| Backend integration tests | `make -f Makefile.harness backend-integration-test` |
+| CI-equivalent local run | `make -f Makefile.harness ci` |
+| Backend deploy | `make -f Makefile.harness deploy` |
+| CI + telemetry review | `make -f Makefile.harness verify` |
 
-# Run specific test file
-uv run pytest tests/unit/test_dos_prevention.py -v
+Use `dev` for interactive local work, and `dev-up` / `dev-down` when an agent or developer needs the servers managed in the background. Background PID files and logs live under `.harness/dev/`.
+Run GitLab deploy pipelines with `DEPLOY_BACKEND=true`, `DEPLOYMENT_TARGET_NAME=<environment>`, and either `TAG_NAME=<image-tag>` or `CONTAINER_IMAGE=<full-image>`, plus environment-scoped Azure and `GTC_*` deploy variables.
 
-# Run tests matching keyword
-uv run pytest tests/unit/ -k "bulk" -v
+### Demo Mode
 
-# Type checking (uses 'ty' not mypy)
-uv run ty check app/  # Check entire app directory
-uv run ty check app/api/v1/ground_truths.py  # Check specific file
+To start dev servers with in-memory demo data (no Cosmos dependency):
+
+```bash
+VITE_DEMO_MODE=true VITE_DEV_USER_ID=demo-user make dev-up
 ```
 
-### Frontend (Node.js)
+If you prefer a single shortcut target with the same stable demo identity, use:
 
 ```bash
-cd frontend
+make -f Makefile.harness dev-up-demo
+```
 
-# Run unit tests once (preferred for automation/agents)
-# Note: Vitest 3.2.4 doesn't support `--no-threads` at runtime; use the threads pool in single-thread mode to avoid spawning many Node processes.
-npm run test:run -- --pool=threads --poolOptions.threads.singleThread
+The `sample.env` sets `GTC_REPO_BACKEND=cosmos` by default. To force the in-memory backend (e.g. when no Cosmos emulator is running), add `GTC_REPO_BACKEND=memory`:
 
-# Pre-commit validation (lint + typecheck, no auto-fix)
-npm run pre-commit
+```bash
+GTC_REPO_BACKEND=memory VITE_DEMO_MODE=true VITE_DEV_USER_ID=demo-user make dev-up
+```
 
-# Build
-npm run build
+## Backend Commands
 
-# Type checking (note: 'typecheck' not 'type-check')
-npm run typecheck
+Run from `backend/`:
 
-# Linting (auto-fix)
-npm run lint
+| Goal | Command |
+|---|---|
+| Install deps | `uv sync` |
+| Dev server | `uv run uvicorn app.main:app --reload` |
+| Test (all) | `uv run pytest tests/unit/ -v` |
+| Test (integration) | `uv run pytest tests/integration/ -v` |
+| Test (single) | `uv run pytest tests/unit/test_dos_prevention.py -v` |
+| Test (keyword) | `uv run pytest tests/unit/ -k "bulk" -v` |
+| Type check | `uv run ty check app/` |
+| Lint | `uv run ruff check app/` |
+| Docs build | `uv run mkdocs build -f ../mkdocs.yml` |
 
-# Linting check only (no auto-fix)
-npm run lint:check
-```
+## Frontend Commands
 
-## Documentation
+Run from `frontend/`:
 
-```bash
-# Build documentation site
-cd backend
-uv run mkdocs build -f ../mkdocs.yml
-
-# Serve documentation locally
-cd backend
-uv run mkdocs serve -f ../mkdocs.yml
-# Then open http://localhost:8000
-```
+| Goal | Command |
+|---|---|
+| Install deps | `npm install` |
+| Dev server | `npm run dev` |
+| Test (all) | `npm run test:run -- --pool=threads --poolOptions.threads.singleThread` |
+| Build | `npm run build` |
+| Type check | `npm run typecheck` |
+| Lint (check) | `npm run lint:check` |
+| Lint (fix) | `npm run lint` |
+| Pre-commit | `npm run pre-commit` |
 
-## Cosmos DB Operations
+## Frontend React Guidance
 
-### Indexing Policy Updates
+- For React and TypeScript work in `frontend/`, consult `.github/skills/react-best-practices/SKILL.md` and `.github/skills/react-best-practices/APPLICABILITY.md`.
+- This frontend is a Vite + React app, not Next.js.
+- Apply framework-agnostic React, re-render, rendering, bundle, and JavaScript performance rules normally.
+- Translate `next/dynamic` guidance to `React.lazy()` or gated `import()` patterns instead of copying Next.js examples directly.
+- Treat Next.js-only rules (`async-api-routes`, `server-*`, API routes, server actions, `after()`, and other server-only guidance) as not applicable unless the stack changes.
+- Treat SWR-specific guidance as reference-only; current frontend data access uses `fetch` / `openapi-fetch` helpers under `frontend/src/api/` and `frontend/src/services/`.
 
-To update the Cosmos DB indexing policy:
+## Debugging Loop
 
-```bash
-cd backend/scripts
+When `make -f Makefile.harness ci` fails, identify the stage before changing code blindly.
 
-# For local emulator
-python cosmos_container_manager.py update-gt \
-  --endpoint https://localhost:8081 \
-  --indexing-policy indexing-policy-optimized.json
+1. `smoke` failed: the backend did not start cleanly, the health probes failed, or the harness telemetry files were not produced.
+2. `check` failed: run the backend/frontend lint or typecheck command directly to isolate the failing side.
+3. `test` failed: run the backend and frontend test commands directly and inspect the failing suite.
+4. `verify` shows warnings/errors: inspect `.harness/logs.jsonl` and `.harness/traces.jsonl` with `jq` before rerunning.
 
-# For production (requires connection string)
-python cosmos_container_manager.py update-gt \
-  --connection-string "$COSMOS_CONNECTION_STRING" \
-  --indexing-policy indexing-policy-optimized.json
-```
+## Constraints And Guardrails
 
-**Note**: Reindexing takes 1-6 hours depending on data size. Monitor progress in Azure Portal or via SDK.
+- Preserve the backend layering: `api/v1 -> services -> adapters`.
+- Do not import adapters directly from FastAPI route modules.
+- Domain models belong in `backend/app/domain/`, not in route handlers or React components.
+- Frontend network calls belong in `frontend/src/api/` or `frontend/src/services/`, not in presentational components.
+- Regenerate frontend API types when backend API schemas change: `cd frontend && npm run api:types`.
+- Do not modify `infra/` without explicit user direction.
+- Treat `scripts/harness/` and `Makefile.harness` as operational code: change them only when the repo workflow actually changes.
+- Backend deploys use `scripts/harness/deploy_backend.sh` (via `make -f Makefile.harness deploy` locally) and expect Azure/auth/runtime values from CI/CD variables rather than repo-side deployment env files.
 
-See `docs/operations/COSMOS-OPTIMIZATION-README.md` for detailed deployment guide.
+## Architecture Boundaries
 
-### Query Performance Monitoring
+- `backend/app/api/v1/` owns HTTP parsing, status codes, and response shapes.
+- `backend/app/services/` owns orchestration and business workflows.
+- `backend/app/adapters/` owns external I/O: Cosmos DB, Search, Blob Storage, inference, and similar integrations.
+- `backend/app/domain/` owns typed request/response/data models shared across backend layers.
+- `backend/app/plugins/` owns computed-tag extensions and registry-driven enrichment.
+- `frontend/src/api/` owns typed backend communication, while `frontend/src/components/` owns rendering and interaction.
 
-To enable RU cost tracking and query performance monitoring:
+See `docs/ARCHITECTURE.md` before changing cross-layer behavior.
 
-```bash
-# Enable all query metrics logging
-export GTC_COSMOS_LOG_QUERY_METRICS=true
-export GTC_COSMOS_LOG_SLOW_QUERIES_ONLY=false
-
-# Enable only slow query logging (RU >= 10.0)
-export GTC_COSMOS_LOG_QUERY_METRICS=true
-export GTC_COSMOS_LOG_SLOW_QUERIES_ONLY=true
-export GTC_COSMOS_SLOW_QUERY_RU_THRESHOLD=10.0
-```
+## Observability Convention
+
+- Local harness runs emit JSONL request logs to `.harness/logs.jsonl` and request traces to `.harness/traces.jsonl`.
+- The backend keeps Azure Monitor / OpenTelemetry support for deployed environments; harness JSONL is the local agent-facing mirror.
+- Request level policy is `INFO` for 2xx, `WARN` for 4xx, and `ERROR` for 5xx or unhandled exceptions.
+- `make -f Makefile.harness verify` reads the last runtime errors and slow traces with `jq`.
+
+See `docs/OBSERVABILITY.md` for field names, examples, and query patterns.
+
+## Execution Plans
+
+- Use `PLANS.md` for multi-step work that spans investigation, implementation, and verification.
+- Capture the objective, non-goals, relevant files, risks, and the exact commands that prove the work is done.
+- Refresh the plan when scope changes so a restarted agent can pick up quickly.
+
+## Static Analysis And Quality Gates
 
-Metrics are logged with structured fields:
+- Run `make -f Makefile.harness format` before committing code changes.
+- Run `make -f Makefile.harness check` before `make -f Makefile.harness test`.
+- Backend quality gate: `uv run ruff check app/` and `uv run ty check app/`.
+- Frontend quality gate: `npm run lint:check` and `npm run typecheck`.
+- Test gate: backend unit tests plus frontend Vitest suite must pass.
+- Smoke gate: backend must boot locally, respond on `/healthz`, answer `/v1/openapi.json`, and emit `.harness` telemetry.
 
-- `operation`: Operation name (e.g., "stats.count_all_items", "list_gt_paginated.direct_query")
-- `ru_charge`: Request Units consumed
-- `item_count`: Number of items returned
-- `elapsed_ms`: Query execution time in milliseconds
-- `query`: First 200 characters of SQL query
+## Known Gotchas
 
-**Note**: Disabled by default to minimize log volume. Enable in staging/production for profiling.
+- Backend type checking uses `ty`, not mypy.
+- Frontend linting uses Biome, not ESLint.
+- Frontend unit tests should use `--pool=threads --poolOptions.threads.singleThread` in agent automation.
+- The backend defaults to `REPO_BACKEND=memory`, which keeps local smoke checks self-contained.
+- `.harness/` is intentionally ephemeral and should never be committed.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..68f01c7
--- /dev/null
+++ b/Makefile
@@ -0,0 +1 @@
+-include Makefile.harness
\ No newline at end of file
diff --git a/Makefile.harness b/Makefile.harness
new file mode 100644
index 0000000..9f592e1
--- /dev/null
+++ b/Makefile.harness
@@ -0,0 +1,85 @@
+.PHONY: setup backend frontend dev dev-up dev-up-demo dev-down format smoke api-check test backend-integration-test lint typecheck check ci deploy verify observe
+
+setup:
+	@./scripts/harness/setup.sh
+
+backend:
+	@cd backend && uv run uvicorn app.main:app --reload
+
+frontend:
+	@cd frontend && npm run dev
+
+dev:
+	@backend_pid=''; frontend_pid=''; \
+	trap 'status=$$?; if [ -n "$$backend_pid" ]; then kill $$backend_pid 2>/dev/null || true; fi; if [ -n "$$frontend_pid" ]; then kill $$frontend_pid 2>/dev/null || true; fi; exit $$status' INT TERM EXIT; \
+	( cd backend && uv run uvicorn app.main:app --reload ) & backend_pid=$$!; \
+	( cd frontend && npm run dev ) & frontend_pid=$$!; \
+	while kill -0 $$backend_pid 2>/dev/null && kill -0 $$frontend_pid 2>/dev/null; do \
+		sleep 1; \
+	done; \
+	status=0; \
+	if ! kill -0 $$backend_pid 2>/dev/null; then \
+		wait $$backend_pid || status=$$?; \
+	fi; \
+	if ! kill -0 $$frontend_pid 2>/dev/null; then \
+		wait $$frontend_pid || status=$$?; \
+	fi; \
+	trap - INT TERM EXIT; \
+	kill $$backend_pid $$frontend_pid 2>/dev/null || true; \
+	wait $$backend_pid $$frontend_pid 2>/dev/null || true; \
+	exit $$status
+
+dev-up:
+	@./scripts/harness/dev_up.sh
+
+dev-up-demo:
+	@GTC_REPO_BACKEND=memory GTC_DEMO_MODE=true GTC_DEMO_USER_ID=demo-user VITE_DEMO_MODE=true VITE_DEV_USER_ID=demo-user ./scripts/harness/dev_up.sh
+
+dev-down:
+	@./scripts/harness/dev_down.sh
+
+format:
+	@./scripts/harness/format.sh
+
+smoke:
+	@./scripts/harness/smoke.sh
+
+api-check:
+	@./scripts/harness/api_check.sh
+
+test:
+	@./scripts/harness/test.sh
+
+backend-integration-test:
+	@./scripts/harness/backend_integration_test.sh
+
+lint:
+	@./scripts/harness/lint.sh
+
+typecheck:
+	@./scripts/harness/typecheck.sh
+
+check: lint typecheck
+
+ci: smoke check api-check test
+
+deploy:
+	@./scripts/harness/deploy_backend.sh
+
+# Agent verify loop: run checks + inspect runtime telemetry
+verify: ci
+	@echo "--- Runtime Errors ---"
+	@jq 'select(.level == "ERROR")' .harness/logs.jsonl 2>/dev/null | tail -5 || true
+	@echo "--- Slow Requests (>1s) ---"
+	@jq 'select(.duration_ms > 1000)' .harness/traces.jsonl 2>/dev/null | tail -5 || true
+
+# Quick observability check (no test run)
+observe:
+	@echo "=== Errors ==="
+	@jq -s 'map(select(.level == "ERROR")) | length' .harness/logs.jsonl 2>/dev/null || echo "0"
+	@echo "=== Slow (>500ms) ==="
+	@jq -s 'map(select(.duration_ms > 500)) | length' .harness/traces.jsonl 2>/dev/null || echo "0"
+	@echo "=== Log Lines ==="
+	@wc -l .harness/logs.jsonl 2>/dev/null | awk '{print $$1}' || echo "0"
+	@echo "=== Trace Lines ==="
+	@wc -l .harness/traces.jsonl 2>/dev/null | awk '{print $$1}' || echo "0"
diff --git a/README.md b/README.md
index ee431fc..1b73096 100644
--- a/README.md
+++ b/README.md
@@ -22,20 +22,38 @@ npm install
 
 ```bash
 # Start backend
-cd backend
-uv run uvicorn app.main:app --reload
+make -f Makefile.harness backend
 
 # Start frontend (in another terminal)
-cd frontend
-npm run dev
+make -f Makefile.harness frontend
+
+# Or run both from one terminal
+make -f Makefile.harness dev
+
+# Or run both in the background
+make -f Makefile.harness dev-up
+make -f Makefile.harness dev-down
 ```
 
+These targets wrap the existing local dev commands in `backend/` and `frontend/`. Use `dev` for a foreground session, or `dev-up` / `dev-down` when an agent or developer wants background-managed servers with logs in `.harness/dev/`.
+
+To start the background demo stack with seeded demo data and a stable local identity, run:
+
+```bash
+VITE_DEMO_MODE=true VITE_DEV_USER_ID=demo-user make dev-up
+```
+
+This enables the demo UI flow, seeds the backend memory repo with demo items, and uses `demo-user` for assignment-aware API calls. Stop it later with `make dev-down`.
+
+If you prefer a single shortcut target, `make -f Makefile.harness dev-up-demo` applies the same memory-backed demo settings and `demo-user` identity automatically.
+
 ### Running Tests
 
 ```bash
 # Backend tests
 cd backend
 uv run pytest tests/unit/ -v
+uv run pytest tests/integration/ -v
 
 # Frontend tests
 cd frontend
diff --git a/backend/app/adapters/agent_steps_store.py b/backend/app/adapters/agent_steps_store.py
deleted file mode 100644
index 4dd3e2b..0000000
--- a/backend/app/adapters/agent_steps_store.py
+++ /dev/null
@@ -1,16 +0,0 @@
-from __future__ import annotations
-
-from typing import Any
-
-
-class AgentStepsStore:
-    """Placeholder interface for persisting agent step details."""
-
-    async def save(
-        self,
-        *,
-        user_id: str,
-        request: dict[str, Any],
-        response: dict[str, Any],
-    ) -> None:
-        raise NotImplementedError
diff --git a/backend/app/adapters/gtc_inference_adapter.py b/backend/app/adapters/gtc_inference_adapter.py
deleted file mode 100644
index 68605d2..0000000
--- a/backend/app/adapters/gtc_inference_adapter.py
+++ /dev/null
@@ -1,213 +0,0 @@
-"""
-GTC Inference Adapter - wraps test-client's InferenceService for GTC's simplified interface.
-
-This adapter provides a shim layer between:
-- GTC's ChatService which expects: generate(user_id, message) -> {content, references}
-- test-client's InferenceService which expects: process_inference_request(history, bus) -> {response_text, calls, ...}
-
-The underlying inference.py from test-client remains UNCHANGED.
-"""
-
-from __future__ import annotations
-
-import logging
-from typing import Any
-
-from azure.identity import DefaultAzureCredential
-
-from app.adapters.inference import (
-    EventBus,
-    InferenceService,
-    ConversationTurn,
-)
-
-logger = logging.getLogger(__name__)
-
-# Limits for reference extraction (matching original inference_service.py)
-MAX_RESULTS = 100
-MAX_STRING_LENGTH = 1000  # For content/title fields
-
-
-class GTCInferenceAdapter:
-    """
-    Adapter that wraps test-client's InferenceService for GTC's simplified interface.
-
-    This adapter:
-    - Converts GTC's (user_id, message) input to ConversationTurn history
-    - Creates a no-op EventBus (with optional logging handlers)
-    - Calls the underlying InferenceService.process_inference_request()
-    - Converts the response {response_text, calls} to GTC's {content, references}
-    """
-
-    def __init__(
-        self,
-        *,
-        project_endpoint: str,
-        agent_id: str,
-        retrieval_url: str,
-        permissions_scope: str,
-        timeout_seconds: int = 30,
-        credential: DefaultAzureCredential | None = None,
-    ) -> None:
-        """
-        Initialize the GTC Inference Adapter.
-
-        Args:
-            project_endpoint: Azure AI Project endpoint
-            agent_id: Azure AI Agent ID
-            retrieval_url: URL for the retrieval service
-            permissions_scope: OAuth scope for retrieval service authentication
-            timeout_seconds: Timeout for HTTP requests (passed to underlying service)
-            credential: Optional Azure credential (creates DefaultAzureCredential if None)
-        """
-        self._project_endpoint = project_endpoint
-        self._agent_id = agent_id
-        self._retrieval_url = retrieval_url
-        self._permissions_scope = permissions_scope
-        self._timeout_seconds = timeout_seconds
-
-        # Use provided credential or create new one
-        self._credential = credential or DefaultAzureCredential(
-            exclude_shared_token_cache_credential=True
-        )
-
-        # Create the underlying InferenceService from test-client
-        # Note: The test-client's InferenceService doesn't accept timeout_seconds,
-        # but the retrieval tool inside uses a hardcoded 30s timeout
-        self._inference_service = InferenceService(
-            project_endpoint=self._project_endpoint,
-            agent_id=self._agent_id,
-            retrieval_url=self._retrieval_url,
-            permissions_scope=self._permissions_scope,
-            client=None,  # Use real client
-            logger_override=None,  # Use default logger
-        )
-
-        logger.info(
-            "GTCInferenceAdapter initialized (endpoint=%s, agent=%s, retrieval_host=%s)",
-            self._project_endpoint,
-            self._agent_id,
-            self._retrieval_url.split("/")[2] if self._retrieval_url else "none",
-        )
-
-    def close(self) -> None:
-        """Close the underlying inference service."""
-        if hasattr(self._inference_service, "_safe_close_client"):
-            self._inference_service._safe_close_client()
-
-    def _create_event_bus(self) -> EventBus:
-        """Create an EventBus for the inference call."""
-        return EventBus()
-
-    def _extract_references(self, calls: list[dict[str, Any]]) -> list[dict[str, Any]]:
-        """
-        Extract references from tool call results.
-
-        Converts the 'calls' structure from InferenceService to GTC's 'references' format.
-        Each call contains 'results' which are the retrieved documents.
-
-        Args:
-            calls: List of tool calls from InferenceService
-
-        Returns:
-            List of references in GTC format: {id, title, url, snippet}
-        """
-        references: list[dict[str, Any]] = []
-
-        for call in calls:
-            if not isinstance(call, dict):
-                continue
-
-            # Skip calls with errors
-            if call.get("error"):
-                logger.warning("Skipping call with error: %s", call.get("error"))
-                continue
-
-            # Extract results from the call
-            results = call.get("results", [])
-            for doc in results:
-                if not isinstance(doc, dict):
-                    continue
-
-                ref = {
-                    "id": doc.get("chunk_id") or doc.get("id"),
-                    "title": doc.get("title"),
-                    "url": doc.get("url"),
-                    "snippet": doc.get("content"),  # Map content -> snippet for ChatReference
-                }
-
-                # Truncate snippet to prevent excessive data
-                if ref["snippet"] and len(ref["snippet"]) > MAX_STRING_LENGTH:
-                    ref["snippet"] = ref["snippet"][:MAX_STRING_LENGTH] + "..."
-
-                references.append(ref)
-
-        # Cap total references
-        if len(references) > MAX_RESULTS:
-            logger.warning(
-                "Truncating %d references to %d to prevent resource exhaustion",
-                len(references),
-                MAX_RESULTS,
-            )
-            references = references[:MAX_RESULTS]
-
-        return references
-
-    def generate(self, *, user_id: str, message: str) -> dict[str, Any]:
-        """
-        Generate a response using the Azure AI Foundry Agent.
-
-        This method provides the interface expected by GTC's ChatService.
-
-        Args:
-            user_id: User identifier for logging
-            message: User's question/message
-
-        Returns:
-            Dict with 'content' (assistant reply) and 'references' (list of citations)
-
-        Raises:
-            RuntimeError: If agent processing fails
-        """
-        if not message.strip():
-            raise ValueError("message cannot be empty")
-
-        # Create EventBus for this request
-        bus = self._create_event_bus()
-
-        # Convert message to ConversationTurn history
-        history = [ConversationTurn(role="user", msg=message)]
-
-        try:
-            # Call the underlying InferenceService
-            logger.debug("Calling InferenceService for user=%s", user_id)
-            result = self._inference_service.process_inference_request(
-                history=history,
-                bus=bus,
-                disable_retry=False,
-                max_retries=3,
-            )
-
-            # Extract response text
-            response_text = result.get("response_text", "")
-            if not response_text or not response_text.strip():
-                raise RuntimeError("Agent returned empty response")
-
-            # Extract references from calls
-            calls = result.get("calls", [])
-            references = self._extract_references(calls)
-
-            logger.info(
-                "Agent response generated for user=%s, refs=%d, content_len=%d",
-                user_id,
-                len(references),
-                len(response_text),
-            )
-
-            return {"content": response_text, "references": references}
-
-        except Exception as exc:
-            # Convert various exceptions to RuntimeError for consistent interface
-            error_msg = str(exc)
-            logger.error("Agent request failed for user=%s: %s", user_id, error_msg)
-            raise RuntimeError(f"Agent request failed: {error_msg}") from exc
diff --git a/backend/app/adapters/repos/base.py b/backend/app/adapters/repos/base.py
index 4092f8a..7d4739b 100644
--- a/backend/app/adapters/repos/base.py
+++ b/backend/app/adapters/repos/base.py
@@ -4,7 +4,7 @@
 from uuid import UUID
 
 from app.domain.models import (
-    GroundTruthItem,
+    AgenticGroundTruthEntry,
     Stats,
     DatasetCurationInstructions,
     AssignmentDocument,
@@ -17,14 +17,14 @@
 class GroundTruthRepo(Protocol):
     # Core ground truth operations
     async def import_bulk_gt(
-        self, items: list[GroundTruthItem], buckets: int | None = None
+        self, items: list[AgenticGroundTruthEntry], buckets: int | None = None
     ) -> BulkImportResult: ...
     async def list_gt_by_dataset(
         self, dataset: str, status: Optional[GroundTruthStatus] = None
-    ) -> list[GroundTruthItem]: ...
+    ) -> list[AgenticGroundTruthEntry]: ...
     async def list_all_gt(
         self, status: Optional[GroundTruthStatus] = None
-    ) -> list[GroundTruthItem]: ...
+    ) -> list[AgenticGroundTruthEntry]: ...
     async def list_gt_paginated(
         self,
         status: Optional[GroundTruthStatus] = None,
@@ -38,26 +38,34 @@ async def list_gt_paginated(
         sort_order: SortOrder | None = None,
         page: int = 1,
         limit: int = 25,
-    ) -> tuple[list[GroundTruthItem], PaginationMetadata]: ...
-    async def get_gt(self, dataset: str, bucket: UUID, item_id: str) -> GroundTruthItem | None: ...
-    async def upsert_gt(self, item: GroundTruthItem) -> GroundTruthItem: ...
+    ) -> tuple[list[AgenticGroundTruthEntry], PaginationMetadata]: ...
+    async def get_gt(
+        self, dataset: str, bucket: UUID, item_id: str
+    ) -> AgenticGroundTruthEntry | None: ...
+    async def upsert_gt(self, item: AgenticGroundTruthEntry) -> AgenticGroundTruthEntry: ...
     async def soft_delete_gt(self, dataset: str, bucket: UUID, item_id: str) -> None: ...
     async def delete_dataset(self, dataset: str) -> None: ...
     async def stats(self) -> Stats: ...
     async def list_datasets(self) -> list[str]: ...
 
     # Assignment helpers (ground truth assignment lifecycle)
-    async def list_unassigned(self, limit: int) -> list[GroundTruthItem]: ...
+    async def list_unassigned(self, limit: int) -> list[AgenticGroundTruthEntry]: ...
     async def sample_unassigned(
         self, user_id: str, limit: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]: ...
+    ) -> list[AgenticGroundTruthEntry]: ...
+    async def query_unassigned_by_dataset_prefix(
+        self, dataset_prefix: str, user_id: str, take: int, exclude_ids: list[str] | None = None
+    ) -> list[AgenticGroundTruthEntry]: ...
+    async def query_unassigned_global(
+        self, user_id: str, take: int, exclude_ids: list[str] | None = None
+    ) -> list[AgenticGroundTruthEntry]: ...
     async def assign_to(self, item_id: str, user_id: str) -> bool: ...
     async def clear_assignment(self, item_id: str) -> bool: ...
-    async def list_assigned(self, user_id: str) -> list[GroundTruthItem]: ...
+    async def list_assigned(self, user_id: str) -> list[AgenticGroundTruthEntry]: ...
 
     # SME Assignment documents (secondary container)
     async def upsert_assignment_doc(
-        self, user_id: str, gt: GroundTruthItem
+        self, user_id: str, gt: AgenticGroundTruthEntry
     ) -> AssignmentDocument: ...
     async def list_assignments_by_user(self, user_id: str) -> list[AssignmentDocument]: ...
     async def get_assignment_by_gt(
diff --git a/backend/app/adapters/repos/cosmos_repo.py b/backend/app/adapters/repos/cosmos_repo.py
index aac8df5..2dfe5b4 100644
--- a/backend/app/adapters/repos/cosmos_repo.py
+++ b/backend/app/adapters/repos/cosmos_repo.py
@@ -22,10 +22,11 @@
 
 from app.adapters.repos.base import GroundTruthRepo
 from app.domain.models import (
-    GroundTruthItem,
+    AgenticGroundTruthEntry,
     Stats,
     AssignmentDocument,
     DatasetCurationInstructions,
+    BulkImportPersistenceError,
     BulkImportResult,
     PaginationMetadata,
 )
@@ -51,12 +52,15 @@
     ord("\u007f"): " ",
 }
 
-# Cosmos DB SELECT clause for most GroundTruthItem fields used in several functions
+# Cosmos DB SELECT clause for AgenticGroundTruthEntry fields used in several functions
 # list_gt_paginated, _list_gt_paginated_with_emulator, list_gt_by_dataset
+# Note: legacy fields like synthQuestion, editedQuestion are still selected for compatibility
+# during migration, but the model will access them via computed properties
 SELECT_CLAUSE_C = (
     "SELECT c.id, c.datasetName, c.bucket, c.status, c.docType, c.schemaVersion, "
-    "c.curationInstructions, c.synthQuestion, c.editedQuestion, c.answer, c.refs, c.tags, c.manualTags, c.computedTags, c.comment, "
-    "c.history, "
+    "c.synthQuestion, c.editedQuestion, c.answer, c.refs, c.tags, c.manualTags, c.computedTags, c.comment, c.plugins, "
+    "c.scenarioId, c.history, c.contextEntries, c.traceIds, c.toolCalls, c.expectedTools, "
+    "c.feedback, c.metadata, c.createdBy, c.createdAt, c.tracePayload, "
     "c.contextUsedForGeneration, c.contextSource, c.modelUsedForGeneration, "
     "c.semanticClusterNumber, c.weight, c.samplingBucket, c.questionLength, "
     "c.assignedTo, c.assignedAt, c.totalReferences, c.updatedAt, c.updatedBy, c.reviewedAt, c._etag "
@@ -381,20 +385,19 @@ def _sanitize_doc_for_emulator_retry(doc: dict[str, Any]) -> dict[str, Any]:
         except Exception:
             return sanitized
 
-    def _to_doc(self, item: GroundTruthItem) -> dict[str, Any]:
+    def _to_doc(self, item: AgenticGroundTruthEntry) -> dict[str, Any]:
         # Check if the doc has dataset and bucket fields, since they make the PK
         if not item.datasetName:
             self._logger.error(f"Document missing datasetName: {item!r}")
             raise ValueError("Document must have datasetName")
 
-        # The domain model's model_validator computes totalReferences automatically.
-        # Trigger it now if the item was modified after initial validation.
-        if item.totalReferences == 0:
-            # Re-validate to ensure totalReferences is computed
-            item = GroundTruthItem.model_validate(item.model_dump(by_alias=True))
-
         # Dump in JSON mode so datetimes/enums are serialized to strings
         d = item.model_dump(mode="json", by_alias=True)
+
+        # Ensure totalReferences is computed and persisted for sorting/querying
+        # Use the property getter which handles both explicit values and plugin storage
+        d["totalReferences"] = item.totalReferences
+
         if d.get("bucket") is not None:
             d["bucket"] = str(d["bucket"])  # store UUID as string
         # Ensure updatedAt present as ISO string
@@ -404,18 +407,70 @@ def _to_doc(self, item: GroundTruthItem) -> dict[str, Any]:
         return d
 
     @staticmethod
-    def _from_doc(doc: dict[str, Any]) -> GroundTruthItem:
+    def _from_doc(doc: dict[str, Any]) -> AgenticGroundTruthEntry:
         # Normalize doc before validation
         normalized_doc = (
             _restore_unicode_from_cosmos(doc) if settings.COSMOS_DISABLE_UNICODE_ESCAPE else doc
         )
+        from app.plugins.packs.rag_compat import _LEGACY_PLUGIN_FIELDS
+
+        allowed_keys = (
+            {field_name for field_name in AgenticGroundTruthEntry.model_fields}
+            | {
+                field.alias
+                for field in AgenticGroundTruthEntry.model_fields.values()
+                if field.alias is not None
+            }
+            | {
+                # Include computed_fields that need to be preserved from Cosmos documents
+                "totalReferences"  # Computed and persisted for sorting/querying
+            }
+            | set(_LEGACY_PLUGIN_FIELDS)
+        )
+        normalized_doc = {
+            key: value for key, value in normalized_doc.items() if key in allowed_keys
+        }
+
+        plugins = normalized_doc.get("plugins")
+        rag_plugin = plugins.get("rag-compat") if isinstance(plugins, dict) else None
+        rag_data = rag_plugin.get("data") if isinstance(rag_plugin, dict) else None
+        history_annotations = (
+            rag_data.get("historyAnnotations") if isinstance(rag_data, dict) else None
+        )
+        history = normalized_doc.get("history")
+        if isinstance(history, list) and isinstance(history_annotations, list):
+            merged_history: list[Any] = []
+            for index, entry in enumerate(history):
+                if isinstance(entry, dict):
+                    entry_dict = dict(entry)
+                    annotation = (
+                        history_annotations[index] if index < len(history_annotations) else None
+                    )
+                    if isinstance(annotation, dict):
+                        if "refs" in annotation and "refs" not in entry_dict:
+                            entry_dict["refs"] = annotation["refs"]
+                        if (
+                            "expectedBehavior" in annotation
+                            and "expectedBehavior" not in entry_dict
+                        ):
+                            entry_dict["expectedBehavior"] = annotation["expectedBehavior"]
+                    merged_history.append(entry_dict)
+                else:
+                    merged_history.append(entry)
+            normalized_doc["history"] = merged_history
 
         # Convert None to [] for history field (legacy data compatibility)
         if normalized_doc.get("history") is None:
             normalized_doc["history"] = []
 
         # Pydantic will parse aliases automatically
-        item = GroundTruthItem.model_validate(normalized_doc)
+        item = AgenticGroundTruthEntry.model_validate(normalized_doc)
+
+        # IMPORTANT: totalReferences is a @computed_field, so Pydantic won't deserialize it
+        # from the document. We need to manually set it in __dict__ so the property getter
+        # can find it. This preserves the value we computed and persisted in _to_doc.
+        if "totalReferences" in normalized_doc:
+            item.__dict__["totalReferences"] = normalized_doc["totalReferences"]
 
         return item
 
@@ -498,7 +553,7 @@ async def _execute_query_with_metrics(
         return items
 
     async def import_bulk_gt(
-        self, items: list[GroundTruthItem], buckets: int | None = None
+        self, items: list[AgenticGroundTruthEntry], buckets: int | None = None
     ) -> BulkImportResult:
         await self._ensure_initialized()
         # Assign UUID buckets per dataset for items missing bucket
@@ -525,7 +580,8 @@ async def import_bulk_gt(
         assert gt is not None
         success = 0
         errors: list[str] = []
-        for it in items:
+        persistence_errors: list[BulkImportPersistenceError] = []
+        for persistence_index, it in enumerate(items):
             doc = self._to_doc(it)
 
             # Apply UTF-8 fix when using Cosmos emulator
@@ -544,8 +600,14 @@ async def import_bulk_gt(
                         if doc.get("refs")
                         else "unknown"
                     )
-                    errors.append(
-                        f"exists (article: {article_num}, id: {doc.get('id', 'unknown')})"
+                    message = f"exists (article: {article_num}, id: {doc.get('id', 'unknown')})"
+                    errors.append(message)
+                    persistence_errors.append(
+                        BulkImportPersistenceError(
+                            message=message,
+                            item_id=doc.get("id", "unknown"),
+                            persistence_index=persistence_index,
+                        )
                     )
                 else:
                     article_num = (
@@ -553,23 +615,37 @@ async def import_bulk_gt(
                         if doc.get("refs")
                         else "unknown"
                     )
-                    errors.append(
-                        f"create_failed (article: {article_num}, id: {doc.get('id', 'unknown')}): {getattr(e, 'message', str(e))}"
+                    message = (
+                        f"create_failed (article: {article_num}, id: {doc.get('id', 'unknown')}): "
+                        f"{getattr(e, 'message', str(e))}"
                     )
-        return BulkImportResult(imported=success, errors=errors)
+                    errors.append(message)
+                    persistence_errors.append(
+                        BulkImportPersistenceError(
+                            message=message,
+                            item_id=doc.get("id", "unknown"),
+                            persistence_index=persistence_index,
+                        )
+                    )
+        return BulkImportResult(
+            imported=success,
+            errors=errors,
+            persistence_errors=persistence_errors,
+        )
 
     async def list_all_gt(
         self, status: Optional[GroundTruthStatus] = None
-    ) -> list[GroundTruthItem]:
+    ) -> list[AgenticGroundTruthEntry]:
         await self._ensure_initialized()
-        # Cross-partition scan for all GT items; filter by status if provided
-        status_filter = ""
+        # Cross-partition scan for all GT items; exclude non-ground-truth documents (e.g. curation-instructions)
+        clauses: list[str] = ["c.docType = 'ground-truth-item'"]
         params: list[dict[str, Any]] = []
         if status is not None:
-            status_filter = " WHERE c.status = @status"
+            clauses.append("c.status = @status")
             params.append({"name": "@status", "value": status.value})
-        query = f"SELECT * FROM c{status_filter}"
-        items: list[GroundTruthItem] = []
+        where = " WHERE " + " AND ".join(clauses)
+        query = f"SELECT * FROM c{where}"
+        items: list[AgenticGroundTruthEntry] = []
         gt = self._gt_container
         assert gt is not None
         it = gt.query_items(query=query, parameters=params, enable_scan_in_query=True)  # type: ignore
@@ -671,7 +747,7 @@ def _resolve_sort(
         return field, direction
 
     @staticmethod
-    def _sort_key(item: GroundTruthItem, field: SortField) -> tuple[Any, ...]:
+    def _sort_key(item: AgenticGroundTruthEntry, field: SortField) -> tuple[Any, ...]:
         if field == SortField.id:
             return (item.id or "",)
 
@@ -709,7 +785,7 @@ def is_cosmos_emulator_in_use(self) -> bool:
         return "localhost" in self._endpoint or "127.0.0.1" in self._endpoint
 
     @staticmethod
-    def _item_matches_keyword(item: GroundTruthItem, keyword: str) -> bool:
+    def _item_matches_keyword(item: AgenticGroundTruthEntry, keyword: str) -> bool:
         """Check if item matches keyword search (case-insensitive substring match).
 
         Searches across:
@@ -790,7 +866,7 @@ async def list_gt_paginated(
         sort_order: SortOrder | None = None,
         page: int = 1,
         limit: int = 25,
-    ) -> tuple[list[GroundTruthItem], PaginationMetadata]:
+    ) -> tuple[list[AgenticGroundTruthEntry], PaginationMetadata]:
         await self._ensure_initialized()
 
         safe_limit = max(settings.PAGINATION_MIN_LIMIT, min(limit, settings.PAGINATION_MAX_LIMIT))
@@ -882,7 +958,7 @@ async def list_gt_paginated(
             enable_scan_in_query=True,
         )
 
-        items = [self._from_doc(doc) for doc in docs]
+        items: list[AgenticGroundTruthEntry] = [self._from_doc(doc) for doc in docs]
 
         # Get total count for pagination metadata
         total = await self._get_filtered_count(status, dataset, normalized_tags, item_id)
@@ -912,7 +988,7 @@ async def _list_gt_paginated_with_emulator(
         sort_order: SortOrder | None,
         page: int,
         limit: int,
-    ) -> tuple[list[GroundTruthItem], PaginationMetadata]:
+    ) -> tuple[list[AgenticGroundTruthEntry], PaginationMetadata]:
         """Handle pagination for queries with tag and url_ref filters (requires in-memory filtering).
 
         Note: Due to Cosmos DB limitations with ARRAY_CONTAINS + ORDER BY, tag filtering
@@ -947,7 +1023,7 @@ async def _list_gt_paginated_with_emulator(
         # Memory safeguard: limit maximum items to fetch to prevent DoS
         MAX_ITEMS_TO_FETCH = settings.PAGINATION_TAG_FETCH_MAX
 
-        raw_items: list[GroundTruthItem] = []
+        raw_items: list[AgenticGroundTruthEntry] = []
         it = gt.query_items(  # type: ignore
             query=query,
             parameters=filter_params,
@@ -978,7 +1054,7 @@ async def _list_gt_paginated_with_emulator(
             "Filtering tags and ref_url in-memory due to Cosmos DB emulator limitations"
         )
         if tags:
-            filtered_items_tag: list[GroundTruthItem] = []
+            filtered_items_tag: list[AgenticGroundTruthEntry] = []
             tags_set = set(tags)
             for item in raw_items:
                 if item.tags and tags_set.issubset(set(item.tags)):
@@ -987,7 +1063,7 @@ async def _list_gt_paginated_with_emulator(
 
         # Filter excluded tags in-memory (items with ANY excluded tag are filtered out)
         if exclude_tags:
-            filtered_items_exclude: list[GroundTruthItem] = []
+            filtered_items_exclude: list[AgenticGroundTruthEntry] = []
             exclude_tags_set = set(exclude_tags)
             for item in raw_items:
                 # Keep item only if it has NO tags from the exclude list
@@ -998,7 +1074,7 @@ async def _list_gt_paginated_with_emulator(
         # Filter by ref_url in-memory (EXISTS not supported by Cosmos DB emulator)
         if ref_url:
             start = time.time()
-            filtered_items_ref: list[GroundTruthItem] = []
+            filtered_items_ref: list[AgenticGroundTruthEntry] = []
             total_refs_checked = 0
 
             for item in raw_items:
@@ -1009,9 +1085,10 @@ async def _list_gt_paginated_with_emulator(
                 # Check history-level refs if no match yet
                 if not has_match and item.history:
                     for turn in item.history:
-                        if turn.refs:
-                            total_refs_checked += len(turn.refs)
-                            if any(ref_url in ref.url for ref in turn.refs):
+                        turn_refs = getattr(turn, "refs", None)
+                        if turn_refs:
+                            total_refs_checked += len(turn_refs)
+                            if any(ref_url in ref.url for ref in turn_refs):
                                 has_match = True
                                 break
                 if has_match:
@@ -1031,7 +1108,7 @@ async def _list_gt_paginated_with_emulator(
         # Filter by keyword in-memory (case-insensitive substring match)
         if keyword:
             start = time.time()
-            filtered_items_keyword: list[GroundTruthItem] = []
+            filtered_items_keyword: list[AgenticGroundTruthEntry] = []
 
             for item in raw_items:
                 if self._item_matches_keyword(item, keyword):
@@ -1184,7 +1261,7 @@ async def _get_filtered_count(
 
     async def list_gt_by_dataset(
         self, dataset: str, status: Optional[GroundTruthStatus] = None
-    ) -> list[GroundTruthItem]:
+    ) -> list[AgenticGroundTruthEntry]:
         await self._ensure_initialized()
         # Query across all buckets for this dataset by filtering datasetName
         status_filter = ""
@@ -1198,7 +1275,7 @@ async def list_gt_by_dataset(
             "FROM c WHERE c.datasetName = @ds AND (NOT IS_DEFINED(c.docType) OR c.docType != 'curation-instructions')"
             + status_filter
         )
-        items: list[GroundTruthItem] = []
+        items: list[AgenticGroundTruthEntry] = []
         gt = self._gt_container
         assert gt is not None
         it = gt.query_items(query=query, parameters=params, enable_scan_in_query=True)  # type: ignore
@@ -1207,7 +1284,9 @@ async def list_gt_by_dataset(
 
         return items
 
-    async def get_gt(self, dataset: str, bucket: UUID, item_id: str) -> GroundTruthItem | None:
+    async def get_gt(
+        self, dataset: str, bucket: UUID, item_id: str
+    ) -> AgenticGroundTruthEntry | None:
         await self._ensure_initialized()
         # dataset and bucket comprise the hierarchical partition key
         try:
@@ -1290,7 +1369,7 @@ async def upsert_curation_instructions(
                 raise ValueError("etag_mismatch")
             raise
 
-    async def upsert_gt(self, item: GroundTruthItem) -> GroundTruthItem:
+    async def upsert_gt(self, item: AgenticGroundTruthEntry) -> AgenticGroundTruthEntry:
         await self._ensure_initialized()
 
         doc = self._to_doc(item)
@@ -1550,7 +1629,7 @@ async def list_datasets(self) -> list[str]:
             names.append(name)
         return names
 
-    async def list_unassigned(self, limit: int) -> list[GroundTruthItem]:
+    async def list_unassigned(self, limit: int) -> list[AgenticGroundTruthEntry]:
         await self._ensure_initialized()
         if limit <= 0:
             return []
@@ -1570,7 +1649,7 @@ async def list_unassigned(self, limit: int) -> list[GroundTruthItem]:
             enable_scan_in_query=True,
             max_item_count=min(limit, 200),
         )
-        res: list[GroundTruthItem] = []
+        res: list[AgenticGroundTruthEntry] = []
         async for doc in it:  # type: ignore
             res.append(self._from_doc(doc))
             if len(res) >= limit:
@@ -1584,7 +1663,7 @@ async def list_unassigned(self, limit: int) -> list[GroundTruthItem]:
     # we need to add these newly assigned docs into the assignments container with the relevant subset of fields and a link back to the ground truth item.
     async def sample_unassigned(
         self, user_id: str, limit: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]:
+    ) -> list[AgenticGroundTruthEntry]:
         await self._ensure_initialized()
         if limit <= 0:
             self._logger.warning(
@@ -1601,7 +1680,7 @@ async def sample_unassigned(
                 "exclude_count": len(exclude_ids) if exclude_ids else 0,
             },
         )
-        results: list[GroundTruthItem] = await self.list_assigned(user_id)
+        results: list[AgenticGroundTruthEntry] = await self.list_assigned(user_id)
         seen_ids: set[str] = {it.id for it in results}
         # Add caller-provided excludes
         if exclude_ids:
@@ -1659,7 +1738,7 @@ async def sample_unassigned(
         )
 
         # 4) Query each dataset up to its quota (single pass)
-        per_dataset_results: dict[str, list[GroundTruthItem]] = {}
+        per_dataset_results: dict[str, list[AgenticGroundTruthEntry]] = {}
         for ds, q in quotas.items():
             if q <= 0:
                 self._logger.debug(
@@ -1744,7 +1823,7 @@ async def sample_unassigned(
 
     async def query_unassigned_by_dataset_prefix(
         self, dataset_prefix: str, user_id: str, take: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]:
+    ) -> list[AgenticGroundTruthEntry]:
         if take <= 0:
             return []
         await self._ensure_initialized()
@@ -1788,7 +1867,7 @@ async def query_unassigned_by_dataset_prefix(
             enable_scan_in_query=True,
             max_item_count=min(take, 200),
         )
-        res: list[GroundTruthItem] = []
+        res: list[AgenticGroundTruthEntry] = []
         async for doc in it:  # type: ignore
             res.append(self._from_doc(doc))
             if len(res) >= take:
@@ -1805,7 +1884,7 @@ async def query_unassigned_by_dataset_prefix(
 
     async def query_unassigned_global(
         self, user_id: str, take: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]:
+    ) -> list[AgenticGroundTruthEntry]:
         if take <= 0:
             return []
         await self._ensure_initialized()
@@ -1843,7 +1922,7 @@ async def query_unassigned_global(
             enable_scan_in_query=True,
             max_item_count=min(take, 200),
         )
-        res: list[GroundTruthItem] = []
+        res: list[AgenticGroundTruthEntry] = []
         async for doc in it:  # type: ignore
             res.append(self._from_doc(doc))
             if len(res) >= take:
@@ -2189,7 +2268,7 @@ async def _clear_assignment_with_read_modify_replace(self, item_id: str) -> bool
             )
             return False
 
-    async def list_assigned(self, user_id: str) -> list[GroundTruthItem]:
+    async def list_assigned(self, user_id: str) -> list[AgenticGroundTruthEntry]:
         await self._ensure_initialized()
         query = "SELECT * FROM c WHERE c.assignedTo = @u AND c.status = 'draft'"
         gt = self._gt_container
@@ -2199,14 +2278,16 @@ async def list_assigned(self, user_id: str) -> list[GroundTruthItem]:
             parameters=[{"name": "@u", "value": user_id}],
             enable_scan_in_query=True,
         )  # type: ignore
-        items: list[GroundTruthItem] = []
+        items: list[AgenticGroundTruthEntry] = []
         async for doc in it:  # type: ignore
             items.append(self._from_doc(doc))
         self._logger.debug("repo.list_assigned", extra={"count": len(items)})
         return items
 
     # Assignment documents APIs (assignments container)
-    async def upsert_assignment_doc(self, user_id: str, gt: GroundTruthItem) -> AssignmentDocument:
+    async def upsert_assignment_doc(
+        self, user_id: str, gt: AgenticGroundTruthEntry
+    ) -> AssignmentDocument:
         await self._ensure_initialized()
         doc_id = f"{gt.datasetName}|{str(gt.bucket)}|{gt.id}"
         ad = AssignmentDocument(
diff --git a/backend/app/adapters/repos/memory_repo.py b/backend/app/adapters/repos/memory_repo.py
new file mode 100644
index 0000000..84d278a
--- /dev/null
+++ b/backend/app/adapters/repos/memory_repo.py
@@ -0,0 +1,453 @@
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from math import ceil
+from typing import Iterable
+from uuid import UUID
+
+from app.domain.enums import GroundTruthStatus, SortField, SortOrder
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+    AssignmentDocument,
+    BulkImportPersistenceError,
+    BulkImportResult,
+    DatasetCurationInstructions,
+    PaginationMetadata,
+    Stats,
+)
+
+ZERO_UUID = UUID("00000000-0000-0000-0000-000000000000")
+
+
+class InMemoryGroundTruthRepo:
+    def __init__(
+        self,
+        *,
+        items: list[AgenticGroundTruthEntry] | None = None,
+        curation_instructions: list[DatasetCurationInstructions] | None = None,
+    ) -> None:
+        self.items: dict[str, AgenticGroundTruthEntry] = {}
+        self._locations: dict[tuple[str, UUID, str], str] = {}
+        self._assignment_docs: dict[tuple[str, str], AssignmentDocument] = {}
+        self._curation: dict[str, DatasetCurationInstructions] = {}
+        self._etag_version = 0
+
+        for item in items or []:
+            self._store_initial_item(item)
+        for doc in curation_instructions or []:
+            self._curation[doc.datasetName] = self._clone_instruction(doc)
+
+    def _now(self) -> datetime:
+        return datetime.now(timezone.utc)
+
+    def _next_etag(self) -> str:
+        self._etag_version += 1
+        return f"memory-etag-{self._etag_version}"
+
+    def _clone_item(self, item: AgenticGroundTruthEntry) -> AgenticGroundTruthEntry:
+        return AgenticGroundTruthEntry.model_validate(item.model_dump(by_alias=True))
+
+    def _clone_instruction(self, doc: DatasetCurationInstructions) -> DatasetCurationInstructions:
+        return DatasetCurationInstructions.model_validate(doc.model_dump(by_alias=True))
+
+    def _location_key(
+        self, dataset: str, bucket: UUID | None, item_id: str
+    ) -> tuple[str, UUID, str]:
+        return (dataset, bucket or ZERO_UUID, item_id)
+
+    def _store_initial_item(self, item: AgenticGroundTruthEntry) -> None:
+        stored = self._clone_item(item)
+        now = stored.updated_at or self._now()
+        if stored.created_at is None:
+            stored.created_at = now
+        stored.updated_at = now
+        stored.etag = self._next_etag()
+        if stored.bucket is None:
+            stored.bucket = ZERO_UUID
+        self.items[stored.id] = stored
+        self._locations[self._location_key(stored.datasetName, stored.bucket, stored.id)] = (
+            stored.id
+        )
+        if stored.assignedTo:
+            self._assignment_docs[(stored.assignedTo, stored.id)] = AssignmentDocument(
+                id=f"{stored.assignedTo}:{stored.id}",
+                pk=stored.assignedTo,
+                ground_truth_id=stored.id,
+                datasetName=stored.datasetName,
+                bucket=stored.bucket,
+            )
+
+    def _save_item(self, item: AgenticGroundTruthEntry) -> AgenticGroundTruthEntry:
+        stored = self._clone_item(item)
+        if stored.bucket is None:
+            stored.bucket = ZERO_UUID
+        if stored.created_at is None:
+            stored.created_at = self._now()
+        stored.updated_at = self._now()
+        stored.etag = self._next_etag()
+        self.items[stored.id] = stored
+        self._locations[self._location_key(stored.datasetName, stored.bucket, stored.id)] = (
+            stored.id
+        )
+        return self._clone_item(stored)
+
+    def _get_stored(self, item_id: str) -> AgenticGroundTruthEntry | None:
+        return self.items.get(item_id)
+
+    def _matches_location(
+        self, item: AgenticGroundTruthEntry, dataset: str, bucket: UUID, item_id: str
+    ) -> bool:
+        return (
+            item.id == item_id
+            and item.datasetName == dataset
+            and (item.bucket or ZERO_UUID) == bucket
+        )
+
+    def _collect_urls(self, item: AgenticGroundTruthEntry) -> Iterable[str]:
+        for ref in item.refs:
+            yield ref.url
+        for turn in item.history or []:
+            for ref in getattr(turn, "refs", None) or []:
+                yield ref.url
+
+    def _collect_text(self, item: AgenticGroundTruthEntry) -> str:
+        parts = [
+            item.id,
+            item.datasetName,
+            item.synth_question or "",
+            item.edited_question or "",
+            item.answer or "",
+            item.comment or "",
+        ]
+        for turn in item.history or []:
+            parts.append(turn.msg)
+        for ref in item.refs:
+            parts.extend([ref.title or "", ref.url, ref.content or "", ref.keyExcerpt or ""])
+        for turn in item.history or []:
+            for ref in getattr(turn, "refs", None) or []:
+                parts.extend([ref.title or "", ref.url, ref.content or "", ref.keyExcerpt or ""])
+        return " ".join(parts).lower()
+
+    def _is_unassigned_candidate(self, item: AgenticGroundTruthEntry) -> bool:
+        return not item.assignedTo and item.status in {
+            GroundTruthStatus.draft,
+            GroundTruthStatus.skipped,
+        }
+
+    def _sort_items(
+        self,
+        items: list[AgenticGroundTruthEntry],
+        sort_by: SortField | None,
+        sort_order: SortOrder | None,
+    ) -> list[AgenticGroundTruthEntry]:
+        field = sort_by or SortField.reviewed_at
+        reverse = (sort_order or SortOrder.desc) == SortOrder.desc
+
+        def key(item: AgenticGroundTruthEntry):
+            if field == SortField.updated_at:
+                return item.updated_at or datetime.min.replace(tzinfo=timezone.utc)
+            if field == SortField.id:
+                return item.id
+            if field == SortField.has_answer:
+                return (
+                    1 if (item.answer or "").strip() else 0,
+                    item.updated_at or datetime.min.replace(tzinfo=timezone.utc),
+                )
+            if field == SortField.totalReferences:
+                return (
+                    item.totalReferences,
+                    item.updated_at or datetime.min.replace(tzinfo=timezone.utc),
+                )
+            if field == SortField.tag_count:
+                return (
+                    len(item.tags),
+                    item.updated_at or datetime.min.replace(tzinfo=timezone.utc),
+                )
+            return item.reviewed_at or datetime.min.replace(tzinfo=timezone.utc)
+
+        return sorted(items, key=key, reverse=reverse)
+
+    async def import_bulk_gt(
+        self, items: list[AgenticGroundTruthEntry], buckets: int | None = None
+    ) -> BulkImportResult:
+        imported = 0
+        persistence_errors: list[BulkImportPersistenceError] = []
+        for index, item in enumerate(items):
+            if item.id in self.items:
+                persistence_errors.append(
+                    BulkImportPersistenceError(
+                        message=f"duplicate_id (id: {item.id})",
+                        item_id=item.id,
+                        persistence_index=index,
+                    )
+                )
+                continue
+            self._save_item(item)
+            imported += 1
+        return BulkImportResult(imported=imported, persistence_errors=persistence_errors)
+
+    async def list_gt_by_dataset(
+        self, dataset: str, status: GroundTruthStatus | None = None
+    ) -> list[AgenticGroundTruthEntry]:
+        items = [item for item in self.items.values() if item.datasetName == dataset]
+        if status is not None:
+            items = [item for item in items if item.status == status]
+        return [
+            self._clone_item(item)
+            for item in self._sort_items(items, SortField.updated_at, SortOrder.desc)
+        ]
+
+    async def list_all_gt(
+        self, status: GroundTruthStatus | None = None
+    ) -> list[AgenticGroundTruthEntry]:
+        items = list(self.items.values())
+        if status is not None:
+            items = [item for item in items if item.status == status]
+        return [
+            self._clone_item(item)
+            for item in self._sort_items(items, SortField.updated_at, SortOrder.desc)
+        ]
+
+    async def list_gt_paginated(
+        self,
+        status: GroundTruthStatus | None = None,
+        dataset: str | None = None,
+        tags: list[str] | None = None,
+        exclude_tags: list[str] | None = None,
+        item_id: str | None = None,
+        ref_url: str | None = None,
+        keyword: str | None = None,
+        sort_by: SortField | None = None,
+        sort_order: SortOrder | None = None,
+        page: int = 1,
+        limit: int = 25,
+    ) -> tuple[list[AgenticGroundTruthEntry], PaginationMetadata]:
+        filtered = list(self.items.values())
+        if status is not None:
+            filtered = [item for item in filtered if item.status == status]
+        if dataset:
+            filtered = [item for item in filtered if item.datasetName == dataset]
+        if tags:
+            required = set(tags)
+            filtered = [item for item in filtered if required.issubset(set(item.tags))]
+        if exclude_tags:
+            banned = set(exclude_tags)
+            filtered = [item for item in filtered if not banned.intersection(set(item.tags))]
+        if item_id:
+            filtered = [item for item in filtered if item_id in item.id]
+        if ref_url:
+            filtered = [
+                item for item in filtered if any(ref_url in url for url in self._collect_urls(item))
+            ]
+        if keyword:
+            lowered = keyword.lower()
+            filtered = [item for item in filtered if lowered in self._collect_text(item)]
+
+        sorted_items = self._sort_items(filtered, sort_by, sort_order)
+        total = len(sorted_items)
+        start = (page - 1) * limit
+        end = start + limit
+        page_items = [self._clone_item(item) for item in sorted_items[start:end]]
+        total_pages = ceil(total / limit) if total else 0
+        metadata = PaginationMetadata(
+            page=page,
+            limit=limit,
+            total=total,
+            totalPages=total_pages,
+            hasNext=end < total,
+            hasPrev=page > 1 and total > 0,
+        )
+        return page_items, metadata
+
+    async def get_gt(
+        self, dataset: str, bucket: UUID, item_id: str
+    ) -> AgenticGroundTruthEntry | None:
+        item = self._get_stored(item_id)
+        if item is None or not self._matches_location(item, dataset, bucket, item_id):
+            return None
+        return self._clone_item(item)
+
+    async def upsert_gt(self, item: AgenticGroundTruthEntry) -> AgenticGroundTruthEntry:
+        existing = self._get_stored(item.id)
+        if existing is not None and item.etag and existing.etag != item.etag:
+            raise ValueError("etag_mismatch")
+        candidate = self._clone_item(item)
+        if existing is not None and candidate.created_at is None:
+            candidate.created_at = existing.created_at
+        saved = self._save_item(candidate)
+        if saved.assignedTo:
+            await self.upsert_assignment_doc(saved.assignedTo, saved)
+        else:
+            stale_docs = [key for key in self._assignment_docs if key[1] == saved.id]
+            for key in stale_docs:
+                self._assignment_docs.pop(key, None)
+        return saved
+
+    async def soft_delete_gt(self, dataset: str, bucket: UUID, item_id: str) -> None:
+        existing = self._get_stored(item_id)
+        if existing is None or not self._matches_location(existing, dataset, bucket, item_id):
+            return
+        existing.status = GroundTruthStatus.deleted
+        existing.assignedTo = None
+        existing.assigned_at = None
+        self._save_item(existing)
+        stale_docs = [key for key in self._assignment_docs if key[1] == item_id]
+        for key in stale_docs:
+            self._assignment_docs.pop(key, None)
+
+    async def delete_dataset(self, dataset: str) -> None:
+        ids = [item.id for item in self.items.values() if item.datasetName == dataset]
+        for item_id in ids:
+            self.items.pop(item_id, None)
+            stale_docs = [key for key in self._assignment_docs if key[1] == item_id]
+            for key in stale_docs:
+                self._assignment_docs.pop(key, None)
+        self._curation.pop(dataset, None)
+        self._locations = {
+            key: value for key, value in self._locations.items() if key[0] != dataset
+        }
+
+    async def stats(self) -> Stats:
+        stats = Stats()
+        for item in self.items.values():
+            if item.status == GroundTruthStatus.draft:
+                stats.draft += 1
+            elif item.status == GroundTruthStatus.approved:
+                stats.approved += 1
+            elif item.status == GroundTruthStatus.deleted:
+                stats.deleted += 1
+        return stats
+
+    async def list_datasets(self) -> list[str]:
+        return sorted({item.datasetName for item in self.items.values()})
+
+    async def list_unassigned(self, limit: int) -> list[AgenticGroundTruthEntry]:
+        items = [item for item in self.items.values() if self._is_unassigned_candidate(item)]
+        return [
+            self._clone_item(item)
+            for item in self._sort_items(items, SortField.updated_at, SortOrder.desc)[:limit]
+        ]
+
+    async def sample_unassigned(
+        self, user_id: str, limit: int, exclude_ids: list[str] | None = None
+    ) -> list[AgenticGroundTruthEntry]:
+        return await self.query_unassigned_global(user_id, limit, exclude_ids)
+
+    async def query_unassigned_by_dataset_prefix(
+        self, dataset_prefix: str, user_id: str, take: int, exclude_ids: list[str] | None = None
+    ) -> list[AgenticGroundTruthEntry]:
+        blocked = set(exclude_ids or [])
+        items = [
+            item
+            for item in self.items.values()
+            if item.datasetName.startswith(dataset_prefix)
+            and self._is_unassigned_candidate(item)
+            and item.id not in blocked
+        ]
+        return [
+            self._clone_item(item)
+            for item in self._sort_items(items, SortField.updated_at, SortOrder.desc)[:take]
+        ]
+
+    async def query_unassigned_global(
+        self, user_id: str, take: int, exclude_ids: list[str] | None = None
+    ) -> list[AgenticGroundTruthEntry]:
+        blocked = set(exclude_ids or [])
+        items = [
+            item
+            for item in self.items.values()
+            if self._is_unassigned_candidate(item) and item.id not in blocked
+        ]
+        return [
+            self._clone_item(item)
+            for item in self._sort_items(items, SortField.updated_at, SortOrder.desc)[:take]
+        ]
+
+    async def assign_to(self, item_id: str, user_id: str) -> bool:
+        existing = self._get_stored(item_id)
+        if existing is None:
+            return False
+        if (
+            existing.assignedTo
+            and existing.assignedTo != user_id
+            and existing.status == GroundTruthStatus.draft
+        ):
+            return False
+        existing.assignedTo = user_id
+        existing.assigned_at = self._now()
+        existing.status = GroundTruthStatus.draft
+        self._save_item(existing)
+        return True
+
+    async def clear_assignment(self, item_id: str) -> bool:
+        existing = self._get_stored(item_id)
+        if existing is None:
+            return False
+        existing.assignedTo = None
+        existing.assigned_at = None
+        self._save_item(existing)
+        stale_docs = [key for key in self._assignment_docs if key[1] == item_id]
+        for key in stale_docs:
+            self._assignment_docs.pop(key, None)
+        return True
+
+    async def list_assigned(self, user_id: str) -> list[AgenticGroundTruthEntry]:
+        items = [
+            item
+            for item in self.items.values()
+            if item.assignedTo == user_id and item.status == GroundTruthStatus.draft
+        ]
+        return [
+            self._clone_item(item)
+            for item in self._sort_items(items, SortField.updated_at, SortOrder.desc)
+        ]
+
+    async def upsert_assignment_doc(
+        self, user_id: str, gt: AgenticGroundTruthEntry
+    ) -> AssignmentDocument:
+        bucket = gt.bucket or ZERO_UUID
+        doc = AssignmentDocument(
+            id=f"{user_id}:{gt.id}",
+            pk=user_id,
+            ground_truth_id=gt.id,
+            datasetName=gt.datasetName,
+            bucket=bucket,
+        )
+        self._assignment_docs[(user_id, gt.id)] = doc
+        return AssignmentDocument.model_validate(doc.model_dump(by_alias=True))
+
+    async def list_assignments_by_user(self, user_id: str) -> list[AssignmentDocument]:
+        docs = [
+            doc
+            for (assigned_user, _), doc in self._assignment_docs.items()
+            if assigned_user == user_id
+        ]
+        return [AssignmentDocument.model_validate(doc.model_dump(by_alias=True)) for doc in docs]
+
+    async def get_assignment_by_gt(
+        self, user_id: str, ground_truth_id: str
+    ) -> AssignmentDocument | None:
+        doc = self._assignment_docs.get((user_id, ground_truth_id))
+        if doc is None:
+            return None
+        return AssignmentDocument.model_validate(doc.model_dump(by_alias=True))
+
+    async def delete_assignment_doc(
+        self, user_id: str, dataset: str, bucket: UUID, ground_truth_id: str
+    ) -> bool:
+        return self._assignment_docs.pop((user_id, ground_truth_id), None) is not None
+
+    async def get_curation_instructions(self, dataset: str) -> DatasetCurationInstructions | None:
+        doc = self._curation.get(dataset)
+        if doc is None:
+            return None
+        return self._clone_instruction(doc)
+
+    async def upsert_curation_instructions(
+        self, doc: DatasetCurationInstructions
+    ) -> DatasetCurationInstructions:
+        stored = self._clone_instruction(doc)
+        stored.updated_at = self._now()
+        stored.etag = self._next_etag()
+        self._curation[stored.datasetName] = stored
+        return self._clone_instruction(stored)
diff --git a/backend/app/adapters/repos/tag_definitions_repo.py b/backend/app/adapters/repos/tag_definitions_repo.py
index 4c2ef02..fabbb08 100644
--- a/backend/app/adapters/repos/tag_definitions_repo.py
+++ b/backend/app/adapters/repos/tag_definitions_repo.py
@@ -135,9 +135,7 @@ async def list_all(self) -> list[TagDefinition]:
         query = "SELECT * FROM c WHERE c.docType = 'tag-definition'"
         items = []
 
-        async for item in self._container.query_items(
-            query=query, enable_scan_in_query=True
-        ):  # type: ignore
+        async for item in self._container.query_items(query=query, enable_scan_in_query=True):  # type: ignore
             try:
                 items.append(TagDefinition.model_validate(item))
             except Exception:
diff --git a/backend/app/adapters/search/demo_search.py b/backend/app/adapters/search/demo_search.py
new file mode 100644
index 0000000..046cfda
--- /dev/null
+++ b/backend/app/adapters/search/demo_search.py
@@ -0,0 +1,46 @@
+from __future__ import annotations
+
+from app.domain.models import AgenticGroundTruthEntry
+
+
+class DemoSearchAdapter:
+    def __init__(self, items: list[AgenticGroundTruthEntry]) -> None:
+        self._items = items
+
+    async def query(self, q: str, top: int = 5) -> list[dict[str, object]]:
+        query = q.strip().lower()
+        if not query:
+            return []
+
+        matches: list[dict[str, object]] = []
+        seen_urls: set[str] = set()
+        for item in self._items:
+            refs = list(item.refs)
+            for turn in item.history or []:
+                refs.extend(getattr(turn, "refs", None) or [])
+            for ref in refs:
+                haystack = " ".join(
+                    [
+                        ref.url,
+                        ref.title or "",
+                        ref.content or "",
+                        ref.keyExcerpt or "",
+                        item.datasetName,
+                        item.id,
+                    ]
+                ).lower()
+                if query not in haystack:
+                    continue
+                if ref.url in seen_urls:
+                    continue
+                seen_urls.add(ref.url)
+                matches.append(
+                    {
+                        "url": ref.url,
+                        "title": ref.title,
+                        "chunk": ref.content or ref.keyExcerpt or f"Reference for {item.id}",
+                    }
+                )
+                if len(matches) >= top:
+                    return matches
+        return matches
diff --git a/backend/app/adapters/trace_export.py b/backend/app/adapters/trace_export.py
new file mode 100644
index 0000000..5dba316
--- /dev/null
+++ b/backend/app/adapters/trace_export.py
@@ -0,0 +1,7 @@
+"""Backward-compatibility shim — the canonical location is now
+``app.plugins.adapters.trace_export``.
+"""
+
+from app.plugins.adapters.trace_export import TraceExportAdapter
+
+__all__ = ["TraceExportAdapter"]
diff --git a/backend/app/api/v1/assignments.py b/backend/app/api/v1/assignments.py
index c3056c0..f5382cd 100644
--- a/backend/app/api/v1/assignments.py
+++ b/backend/app/api/v1/assignments.py
@@ -6,15 +6,34 @@
 from uuid import UUID
 
 from pydantic import BaseModel, Field, ConfigDict
+from pydantic.json_schema import SkipJsonSchema
 import logging
 
 from app.core.auth import get_current_user, UserContext
 from app.core.errors import AssignmentConflictError
-from app.domain.models import GroundTruthItem, Reference, AssignmentDocument, HistoryItem
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+    AssignmentDocument,
+    ContextEntry,
+    ExpectedTools,
+    FeedbackEntry,
+    PluginPayload,
+    ToolCallRecord,
+)
 from app.domain.enums import GroundTruthStatus
 from app.container import container
-from datetime import datetime, timezone
-from app.services.tagging_service import apply_computed_tags
+from app.services.ground_truth_update_service import (
+    ETagMismatchError,
+    ETagRequiredError,
+    apply_shared_update,
+    persist_shared_update,
+    read_legacy_compat_update,
+)
+from app.services.validation_service import (
+    ApprovalValidationError,
+    ValidationError,
+)
+from app.api.v1.update_models import HistoryEntryPatch
 
 
 router = APIRouter()
@@ -22,29 +41,31 @@
 
 
 class SelfServeResponse(BaseModel):
-    assigned: list[GroundTruthItem]
+    assigned: list[AgenticGroundTruthEntry]
     requested: int
     assignedCount: int
 
 
 class AssignmentUpdateRequest(BaseModel):
-    """Payload for SME update (save draft / approve / skip / delete).
-
-    Using a Pydantic model allows camelCase -> snake_case alias handling. All fields optional; we
-    only mutate those explicitly provided (tracked via model_fields_set).
-    """
-
     model_config = ConfigDict(populate_by_name=True, extra="allow")
 
-    edited_question: Optional[str] = Field(default=None, alias="editedQuestion")
-    answer: Optional[str] = None
     comment: Optional[str] = None
-    status: Optional[GroundTruthStatus | str] = None
-    refs: Optional[list[Reference]] = None
+    status: GroundTruthStatus | str | SkipJsonSchema[None] = None
     manual_tags: Optional[list[str]] = Field(default=None, alias="manualTags")
     approve: Optional[bool] = None
     etag: Optional[str] = Field(default=None, alias="etag")
-    history: Optional[list[dict[str, Any]]] = None
+    history: Optional[list[HistoryEntryPatch]] = None
+    context_entries: Optional[list[ContextEntry]] = Field(default=None, alias="contextEntries")
+    tool_calls: Optional[list[ToolCallRecord]] = Field(default=None, alias="toolCalls")
+    expected_tools: ExpectedTools | SkipJsonSchema[None] = Field(
+        default=None, alias="expectedTools"
+    )
+    feedback: Optional[list[FeedbackEntry]] = None
+    metadata: Optional[dict[str, Any]] = None
+    plugins: Optional[dict[str, PluginPayload]] = None
+    trace_ids: Optional[dict[str, str]] = Field(default=None, alias="traceIds")
+    trace_payload: Optional[dict[str, Any]] = Field(default=None, alias="tracePayload")
+    scenario_id: Optional[str] = Field(default=None, alias="scenarioId")
 
 
 @router.post("/self-serve")
@@ -74,12 +95,12 @@ async def self_serve_assignments(
 @router.get("/my")
 async def list_my_assignments(
     user: UserContext = Depends(get_current_user),
-) -> list[GroundTruthItem]:
+) -> list[AgenticGroundTruthEntry]:
     # Fetch assignment documents (materialized view), then fetch underlying GroundTruth items.
     assignments: list[AssignmentDocument] = await container.repo.list_assignments_by_user(
         user.user_id
     )
-    results: list[GroundTruthItem] = []
+    results: list[AgenticGroundTruthEntry] = []
     for ad in assignments:
         gt = await container.repo.get_gt(ad.datasetName, ad.bucket, ad.ground_truth_id)
         if not gt:
@@ -101,7 +122,7 @@ async def update_item(
     payload: AssignmentUpdateRequest,
     user: UserContext = Depends(get_current_user),
     if_match: str | None = Header(default=None, alias="If-Match"),
-) -> GroundTruthItem:
+) -> AgenticGroundTruthEntry:
     # Fold soft delete into PUT via status=deleted
     it = await container.repo.get_gt(dataset, bucket, item_id)
     if not it:
@@ -116,110 +137,54 @@ async def update_item(
     original_assigned_to = it.assignedTo
 
     provided_fields: Set[str] = set(payload.model_fields_set)
-
-    # Apply updates conditionally
-    if "edited_question" in provided_fields:
-        it.edited_question = payload.edited_question  # type: ignore[assignment]
-    if "answer" in provided_fields:
-        it.answer = payload.answer  # type: ignore[assignment]
-    if "comment" in provided_fields:
-        it.comment = payload.comment  # type: ignore[assignment]
-
-    now = datetime.now(timezone.utc)
-
-    # Track whether we need to delete the assignment document
-    should_delete_assignment = False
-
-    # Approve convenience flag
-    if bool(payload.approve):
-        it.status = GroundTruthStatus.approved
-        it.reviewed_at = now
-        it.updatedBy = user.user_id
-        # Clear assignment fields on completion
-        it.assignedTo = None
-        it.assigned_at = None
-        should_delete_assignment = True
-
-    # Status update handling (skip/delete/approved explicitly)
-    if "status" in provided_fields:
-        try:
-            val = payload.status
-            if isinstance(val, GroundTruthStatus) or val is None:
-                it.status = val  # type: ignore[assignment]
-            else:
-                it.status = GroundTruthStatus(str(val))
-        except Exception:
-            it.status = cast(Any, payload.status)  # type: ignore[assignment]
-        if it.status in {GroundTruthStatus.approved, GroundTruthStatus.deleted}:
-            # Clear assignment when moving out of draft (keep for skipped so another SME can pick it up)
-            it.assignedTo = None
-            it.assigned_at = None
-            should_delete_assignment = True
-        if it.status == GroundTruthStatus.approved:
-            it.reviewed_at = now
-            it.updatedBy = user.user_id
-    if "refs" in provided_fields and payload.refs is not None:
-        it.refs = payload.refs  # already validated
-    # Tags (validated by model validators)
-    if "manual_tags" in provided_fields:
-        try:
-            it.manual_tags = payload.manual_tags or []
-        except ValueError as e:
-            raise HTTPException(status_code=400, detail=str(e))
-    # History (with refs in agent messages)
-    if "history" in provided_fields and payload.history is not None:
-        try:
-            # Convert dict representations to HistoryItem models
-            history_items = []
-            for h in payload.history:
-                # Parse refs if present in the history item
-                refs_data = h.get("refs")
-                refs_list = None
-                if refs_data is not None:
-                    refs_list = [
-                        r if isinstance(r, Reference) else Reference(**r) for r in refs_data
-                    ]
-                # Parse expected_behavior if present in the history item
-                expected_behavior_data = h.get("expected_behavior") or h.get("expectedBehavior")
-                history_items.append(
-                    HistoryItem(
-                        role=h["role"],
-                        msg=h.get("msg")
-                        or h.get("content", ""),  # Support both 'msg' and 'content'
-                        refs=refs_list,
-                        expected_behavior=expected_behavior_data
-                        if isinstance(expected_behavior_data, list)
-                        else None,
-                    )
-                )
-            it.history = history_items
-        except (KeyError, ValueError) as e:
-            raise HTTPException(status_code=400, detail=f"Invalid history format: {str(e)}")
-    # ETag handling: require an ETag for all SME updates (approve/skip/delete)
-    provided_etag = payload.etag
-    if not if_match and not provided_etag:
-        raise HTTPException(status_code=412, detail="ETag required")
-    if if_match:
-        it.etag = if_match
-    elif provided_etag:
-        it.etag = provided_etag
-
-    # Apply computed tags before saving
+    payload_extras = payload.model_extra or {}
     try:
-        apply_computed_tags(it)
-    except ValueError as e:
-        # Convert ValueError from validation to HTTP 400
-        raise HTTPException(status_code=400, detail=str(e))
-
+        mutation = apply_shared_update(
+            it,
+            provided_fields=provided_fields,
+            comment=payload.comment,
+            history_entries=payload.history,
+            context_entries=payload.context_entries,
+            tool_calls=payload.tool_calls,
+            expected_tools=payload.expected_tools,
+            feedback=payload.feedback,
+            metadata=payload.metadata,
+            plugins=payload.plugins,
+            trace_ids=payload.trace_ids,
+            trace_payload=payload.trace_payload,
+            scenario_id=payload.scenario_id,
+            manual_tags=payload.manual_tags,
+            status=payload.status,
+            approve=bool(payload.approve),
+            actor_user_id=user.user_id,
+            legacy_update=read_legacy_compat_update(payload_extras),
+            clear_assignment_on_statuses={
+                GroundTruthStatus.approved,
+                GroundTruthStatus.deleted,
+            },
+        )
+    except ValidationError as e:
+        raise HTTPException(status_code=400, detail=e.message)
     try:
-        await container.repo.upsert_gt(it)
+        await persist_shared_update(
+            container.repo,
+            it,
+            if_match=if_match,
+            payload_etag=payload.etag,
+        )
+    except ApprovalValidationError as e:
+        raise HTTPException(
+            status_code=400, detail={"code": "INVALID_APPROVAL", "errors": e.errors}
+        )
+    except ETagRequiredError:
+        raise HTTPException(status_code=412, detail="ETag required")
+    except ETagMismatchError:
+        raise HTTPException(status_code=412, detail="ETag mismatch")
     except ValueError as e:
-        if str(e) == "etag_mismatch":
-            raise HTTPException(status_code=412, detail="ETag mismatch")
-        raise
+        raise HTTPException(status_code=400, detail=str(e))
 
     # Delete assignment document after successful GT update
-    if should_delete_assignment and original_assigned_to:
+    if mutation.should_delete_assignment and original_assigned_to:
         try:
             deleted = await container.repo.delete_assignment_doc(
                 user_id=original_assigned_to,
@@ -272,7 +237,7 @@ async def assign_item(
     item_id: str,
     body: AssignItemRequest | None = None,
     user: UserContext = Depends(get_current_user),
-) -> GroundTruthItem | JSONResponse:
+) -> AgenticGroundTruthEntry | JSONResponse:
     """Assign a specific ground truth item to the current user.
 
     This endpoint:
@@ -346,7 +311,7 @@ async def duplicate_assignment_item(
     bucket: UUID,
     item_id: str,
     user: UserContext = Depends(get_current_user),
-) -> GroundTruthItem:
+) -> AgenticGroundTruthEntry:
     """Duplicate an existing item as a draft rephrase, assign to caller, and return the new item."""
     original = await container.repo.get_gt(dataset, bucket, item_id)
     if not original:
diff --git a/backend/app/api/v1/chat.py b/backend/app/api/v1/chat.py
deleted file mode 100644
index 9eff9aa..0000000
--- a/backend/app/api/v1/chat.py
+++ /dev/null
@@ -1,158 +0,0 @@
-from __future__ import annotations
-
-import logging
-import re
-import uuid
-
-from fastapi import APIRouter, Depends, HTTPException, status
-from pydantic import BaseModel, Field, ConfigDict, field_validator
-
-from app.container import container
-from app.core.auth import Principal, require_user
-from app.core.config import settings
-
-logger = logging.getLogger(__name__)
-
-router = APIRouter()
-
-_SCRIPT_PATTERN = re.compile(r"<\s*/?\s*script\b", re.IGNORECASE)
-_WHITESPACE_PATTERN = re.compile(r"\s+")
-
-# Safe error messages that don't leak internal details
-SAFE_ERROR_MESSAGES = {
-    "invalid_input": "Invalid request format",
-    "service_unavailable": "Service temporarily unavailable",
-    "processing_error": "Unable to process request",
-}
-
-
-def sanitize_message(raw: str) -> str:
-    """Normalize whitespace and reject obvious script tags."""
-    if not raw:
-        return ""
-
-    cleaned = _WHITESPACE_PATTERN.sub(" ", raw).strip()
-    if not cleaned:
-        return ""
-
-    if _SCRIPT_PATTERN.search(cleaned):
-        raise ValueError("message contains disallowed content")
-
-    return cleaned
-
-
-class ChatRequest(BaseModel):
-    message: str = Field(min_length=1)
-    context: str | None = Field(default=None)
-
-    @field_validator("message")
-    @classmethod
-    def _validate_message(_cls, value: str) -> str:
-        sanitized = sanitize_message(value)
-        if not sanitized:
-            raise ValueError("message cannot be empty")
-        return sanitized
-
-    @field_validator("context")
-    @classmethod
-    def _trim_context(_cls, value: str | None) -> str | None:
-        if value is None:
-            return None
-        trimmed = value.strip()
-        return trimmed or None
-
-
-class ChatReference(BaseModel):
-    model_config = ConfigDict(extra="allow")
-
-    id: str | None = None
-    title: str | None = None
-    url: str | None = None
-    snippet: str | None = None
-    keyParagraph: str | None = None
-
-
-class ChatResponse(BaseModel):
-    content: str
-    references: list[ChatReference] = Field(default_factory=list)
-
-
-@router.post("/chat", response_model=ChatResponse)
-async def chat(
-    body: ChatRequest,
-    principal: Principal = Depends(require_user),
-) -> ChatResponse:
-    if not settings.CHAT_ENABLED:
-        raise HTTPException(
-            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
-            detail="Chat is disabled",
-        )
-
-    user_id = principal.email or principal.oid or principal.name or "anonymous"
-
-    # Generate correlation ID for error tracking without exposing internals
-    correlation_id = str(uuid.uuid4())
-
-    logger.info(
-        "Chat request correlation_id=%s user=%s message_length=%d has_context=%s",
-        correlation_id,
-        user_id,
-        len(body.message),
-        body.context is not None,
-    )
-
-    try:
-        result = await container.chat_service.generate_response(
-            user_id=user_id,
-            message=body.message,
-            context=body.context,
-        )
-        logger.info(
-            "Chat success correlation_id=%s ref_count=%d",
-            correlation_id,
-            len(result.get("references", [])),
-        )
-        # Don't log full result at INFO level - may contain PII or sensitive data
-        logger.debug("Chat result correlation_id=%s", correlation_id)
-    except ValueError as exc:
-        # Log validation errors server-side with details
-        logger.warning("Validation error correlation_id=%s error=%s", correlation_id, str(exc))
-        raise HTTPException(
-            status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
-            detail=SAFE_ERROR_MESSAGES["invalid_input"],
-            headers={"X-Correlation-ID": correlation_id},
-        )
-    except RuntimeError as exc:
-        # Log runtime errors server-side with details
-        logger.error("Runtime error correlation_id=%s error=%s", correlation_id, str(exc))
-        raise HTTPException(
-            status_code=status.HTTP_502_BAD_GATEWAY,
-            detail=SAFE_ERROR_MESSAGES["service_unavailable"],
-            headers={"X-Correlation-ID": correlation_id},
-        )
-    except Exception:  # pragma: no cover - safeguard unexpected failures
-        # Log unexpected errors with full stack trace server-side only
-        logger.error("Unexpected error correlation_id=%s", correlation_id, exc_info=True)
-        raise HTTPException(
-            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail=SAFE_ERROR_MESSAGES["processing_error"],
-            headers={"X-Correlation-ID": correlation_id},
-        )
-
-    try:
-        response = ChatResponse(**result)
-        logger.info(
-            "Returning ChatResponse correlation_id=%s content_length=%d references_count=%d",
-            correlation_id,
-            len(response.content),
-            len(response.references),
-        )
-        return response
-    except Exception:
-        # Log response parsing errors server-side only with stack trace
-        logger.error("Response parsing error correlation_id=%s", correlation_id, exc_info=True)
-        raise HTTPException(
-            status_code=status.HTTP_502_BAD_GATEWAY,
-            detail=SAFE_ERROR_MESSAGES["processing_error"],
-            headers={"X-Correlation-ID": correlation_id},
-        )
diff --git a/backend/app/api/v1/ground_truths.py b/backend/app/api/v1/ground_truths.py
index 587495b..32f08cd 100644
--- a/backend/app/api/v1/ground_truths.py
+++ b/backend/app/api/v1/ground_truths.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import re
 import time
 
 from fastapi import APIRouter, Body, Depends, HTTPException, Header, Query
@@ -10,23 +11,42 @@
 import randomname  # type: ignore
 
 from pydantic import BaseModel, Field, ConfigDict
+from pydantic.json_schema import SkipJsonSchema
 from fastapi.responses import JSONResponse
 
 from app.core.auth import get_current_user, UserContext
 from app.core.config import settings
 from app.domain.models import (
-    GroundTruthItem,
-    Reference,
+    AgenticGroundTruthEntry,
+    ContextEntry,
+    ExpectedTools,
+    FeedbackEntry,
     GroundTruthListResponse,
     HistoryItem,
+    PluginPayload,
+    ToolCallRecord,
     BulkImportError,
+    BulkImportPersistenceError,
     ValidationSummary,
 )
 from app.domain.enums import GroundTruthStatus, SortField, SortOrder
 from app.plugins import get_default_registry
 from app.container import container
 from app.exports.models import SnapshotExportRequest
-from app.services.validation_service import validate_bulk_items
+from app.api.v1.update_models import HistoryEntryPatch
+from app.services.ground_truth_update_service import (
+    ETagMismatchError,
+    ETagRequiredError,
+    apply_shared_update,
+    persist_shared_update,
+    read_legacy_compat_update,
+)
+from app.services.validation_service import (
+    ApprovalValidationError,
+    ValidationError,
+    validate_bulk_items,
+    validate_item_for_approval,
+)
 from app.services.pii_service import PIIWarning, scan_bulk_items_for_pii
 from app.services.duplicate_detection_service import (
     DuplicateWarning,
@@ -38,6 +58,15 @@
 logger = logging.getLogger("gtc.ground_truths")
 
 router = APIRouter()
+_PERSISTENCE_ERROR_ITEM_ID_RE = re.compile(r"\bid:\s*([^)]+?)\)")
+
+
+def _extract_persistence_error_item_id(error_msg: str) -> str | None:
+    match = _PERSISTENCE_ERROR_ITEM_ID_RE.search(error_msg)
+    if match is None:
+        return None
+    item_id = match.group(1).strip()
+    return item_id or None
 
 
 class ImportBulkResponse(BaseModel):
@@ -90,9 +119,41 @@ class RecomputeTagsResponse(BaseModel):
     duration_ms: int = Field(description="Operation duration in milliseconds.")
 
 
+class GroundTruthUpdateRequest(BaseModel):
+    model_config = ConfigDict(populate_by_name=True, extra="allow")
+
+    status: GroundTruthStatus | str | SkipJsonSchema[None] = None
+    comment: str | None = None
+    history: list[HistoryEntryPatch] | None = None
+    context_entries: list[ContextEntry] | None = Field(default=None, alias="contextEntries")
+    tool_calls: list[ToolCallRecord] | None = Field(default=None, alias="toolCalls")
+    expected_tools: ExpectedTools | SkipJsonSchema[None] = Field(
+        default=None, alias="expectedTools"
+    )
+    feedback: list[FeedbackEntry] | None = None
+    metadata: dict[str, Any] | None = None
+    plugins: dict[str, PluginPayload] | None = None
+    manual_tags: list[str] | None = Field(default=None, alias="manualTags")
+    trace_ids: dict[str, str] | None = Field(default=None, alias="traceIds")
+    trace_payload: dict[str, Any] | None = Field(default=None, alias="tracePayload")
+    scenario_id: str | None = Field(default=None, alias="scenarioId")
+    etag: str | None = Field(default=None, alias="etag")
+
+
+def _coerce_history_for_internal_use(item: AgenticGroundTruthEntry) -> None:
+    if not item.history:
+        return
+    item.history = [
+        entry
+        if isinstance(entry, HistoryItem)
+        else HistoryItem.model_validate(entry.model_dump(by_alias=True))
+        for entry in item.history
+    ]
+
+
 @router.post("", response_model=ImportBulkResponse)
 async def import_bulk(
-    items: list[GroundTruthItem],
+    items: list[AgenticGroundTruthEntry],
     user: UserContext = Depends(get_current_user),
     buckets: int | None = Query(default=None, ge=1, le=50),
     approve: bool = Query(
@@ -111,7 +172,8 @@ async def import_bulk(
     """
     errors: list[BulkImportError] = []
     uuids: list[str] = []
-    gt_items: list[GroundTruthItem] = []
+    gt_items: list[AgenticGroundTruthEntry] = []
+    unknown_persistence_failures: int = 0
 
     # Ensure IDs in input order, preserving provided IDs when present
     for it in items:
@@ -122,21 +184,34 @@ async def import_bulk(
         it.id = item_id
         uuids.append(item_id)
 
+    # Track failed request-entry positions for accurate per-entry failed counting.
+    # Using a set[int] of original request indices avoids undercounting when
+    # duplicate IDs appear in the same request.
+    failed_request_indices: set[int] = set()
+    # Parallel list of original request indices for items that reach gt_items;
+    # maintains a 1-to-1 correspondence with gt_items throughout the pipeline.
+    gt_item_orig_indices: list[int] = []
+
     # Validate all items before processing
+    # validate_bulk_items is keyed by request-position index (not item.id) so
+    # duplicate IDs in one request do not collapse per-entry error attribution.
     validation_errors = await validate_bulk_items(items)
 
     # If there are validation errors, filter out invalid items and collect errors
     if validation_errors:
-        for item in items:
-            if item.id in validation_errors:
-                # Collect all validation errors for this item
-                errors.extend(validation_errors[item.id])
+        for req_idx, item in enumerate(items):
+            if req_idx in validation_errors:
+                # Collect all validation errors for this request entry
+                errors.extend(validation_errors[req_idx])
+                failed_request_indices.add(req_idx)
             else:
                 # Only include valid items
                 gt_items.append(item)
+                gt_item_orig_indices.append(req_idx)
     else:
         # All items are valid
-        gt_items = items
+        gt_items = list(items)
+        gt_item_orig_indices = list(range(len(items)))
 
     # Optionally set approval metadata
     if approve:
@@ -147,28 +222,110 @@ async def import_bulk(
             it.reviewed_at = now
             it.updatedBy = updater
 
+        # Enforce generic approval validation for approved items.
+        # validate_bulk_items returns results keyed by position within gt_items here
+        # (not by item.id) so duplicate IDs in the filtered list remain distinct.
+        approval_validation_errors = await validate_bulk_items(gt_items)
+
+        # Check generic approval invariants plus plugin-pack hooks for each item
+        approval_ready_items: list[AgenticGroundTruthEntry] = []
+        approval_ready_orig_indices: list[int] = []
+        for gt_pos, item in enumerate(gt_items):
+            orig_idx = gt_item_orig_indices[gt_pos]
+            item_errors = []
+
+            # Add tag validation errors if present; fix index to original request position
+            if gt_pos in approval_validation_errors:
+                for err in approval_validation_errors[gt_pos]:
+                    err.index = orig_idx
+                item_errors.extend(approval_validation_errors[gt_pos])
+
+            # Run the shared approval path (generic core + plugin-pack hooks).
+            # validate_item_for_approval combines collect_approval_validation_errors
+            # with container.plugin_pack_registry.collect_approval_errors so that
+            # plugin-owned constraints (e.g. RagCompatPack) are enforced here too.
+            try:
+                validate_item_for_approval(item)
+            except ApprovalValidationError as exc:
+                for err_msg in exc.errors:
+                    item_errors.append(
+                        BulkImportError(
+                            index=orig_idx,
+                            item_id=item.id,
+                            field="approval",
+                            code="APPROVAL_VALIDATION_FAILED",
+                            message=err_msg,
+                        )
+                    )
+
+            if item_errors:
+                errors.extend(item_errors)
+                failed_request_indices.add(orig_idx)
+            else:
+                approval_ready_items.append(item)
+                approval_ready_orig_indices.append(orig_idx)
+
+        gt_items = approval_ready_items
+        gt_item_orig_indices = approval_ready_orig_indices
+
     # Persist only validated items
     if gt_items:
         # Apply computed tags to each item before persisting
         # Fetch registry once for performance (avoids O(n) singleton lookups)
         registry = get_default_registry()
         for it in gt_items:
+            _coerce_history_for_internal_use(it)
             apply_computed_tags(it, registry)
 
         result = await container.repo.import_bulk_gt(gt_items, buckets=buckets)
 
-        # Convert repository errors (plain strings) to structured errors
-        # Repository doesn't provide index, so we can't map back to original position
-        # These are persistence errors after validation passed
-        for error_msg in result.errors:
+        # Convert repository errors (plain strings) to structured errors.
+        # Cosmos error messages include the item id when available, so recover it
+        # to preserve original request indices and per-entry failed-item counting.
+        # Build a per-id ordered list of original indices from the items submitted
+        # to persistence so duplicate IDs get distinct request positions.
+        id_to_orig_indices: dict[str, list[int]] = {}
+        for orig_idx, it in zip(gt_item_orig_indices, gt_items):
+            id_to_orig_indices.setdefault(it.id, []).append(orig_idx)
+        id_consumed: dict[str, int] = {}  # tracks how many errors per id we've consumed
+        unknown_persistence_failures = 0
+        persistence_errors = result.persistence_errors or [
+            BulkImportPersistenceError(
+                message=error_msg,
+                item_id=_extract_persistence_error_item_id(error_msg),
+            )
+            for error_msg in result.errors
+        ]
+        for persistence_error in persistence_errors:
+            error_msg = persistence_error.message
+            item_id = persistence_error.item_id or _extract_persistence_error_item_id(error_msg)
+            error_code = "CREATE_FAILED" if "create_failed" in error_msg.lower() else "DUPLICATE_ID"
+            if (
+                persistence_error.persistence_index is not None
+                and 0 <= persistence_error.persistence_index < len(gt_item_orig_indices)
+            ):
+                error_index = gt_item_orig_indices[persistence_error.persistence_index]
+                failed_request_indices.add(error_index)
+            elif item_id and item_id in id_to_orig_indices:
+                consumed = id_consumed.get(item_id, 0)
+                indices = id_to_orig_indices[item_id]
+                # Use the consumed-th matching index; clamp to last for extra errors.
+                error_index = indices[consumed] if consumed < len(indices) else indices[-1]
+                id_consumed[item_id] = consumed + 1
+                failed_request_indices.add(error_index)
+            elif item_id:
+                # item_id was in the error but not in our submission set; use -1
+                error_index = -1
+                failed_request_indices.add(-1)
+            else:
+                error_index = -1
+                unknown_persistence_failures += 1
             errors.append(
                 BulkImportError(
-                    index=-1,  # Index unknown for persistence errors
-                    item_id=None,  # Parse from error message if needed
+                    index=error_index,
+                    item_id=item_id,
                     field=None,
-                    code="CREATE_FAILED"
-                    if "create_failed" in error_msg.lower()
-                    else "DUPLICATE_ID",
+                    code=error_code,
                     message=error_msg,
                 )
             )
@@ -191,7 +348,7 @@ async def import_bulk(
         try:
             # Fetch all approved items from the same dataset(s) to check against
             datasets = {item.datasetName for item in items}
-            approved_items: list[GroundTruthItem] = []
+            existing_approved_items: list[AgenticGroundTruthEntry] = []
             for dataset in datasets:
                 # Fetch approved items from this dataset
                 items_list, _ = await container.repo.list_gt_paginated(
@@ -202,9 +359,9 @@ async def import_bulk(
                     sort_by=SortField.updated_at,
                     sort_order=SortOrder.desc,
                 )
-                approved_items.extend(items_list)
+                existing_approved_items.extend(items_list)
 
-            duplicate_warnings = detect_duplicates_for_bulk_items(items, approved_items)
+            duplicate_warnings = detect_duplicates_for_bulk_items(items, existing_approved_items)
             if duplicate_warnings:
                 logger.info(
                     f"api.import_bulk.duplicates_detected - "
@@ -218,7 +375,10 @@ async def import_bulk(
 
     # Build validation summary
     total_items = len(items)
-    failed_count = len(errors)
+    # Count unique failed request entries (not raw error count — one item may produce
+    # multiple errors, and duplicate IDs in one request each count independently).
+    # unknown_persistence_failures counts errors with no recoverable item id.
+    failed_count = len(failed_request_indices) + unknown_persistence_failures
     validation_summary = ValidationSummary(
         total=total_items,
         succeeded=imported_count,
@@ -412,7 +572,7 @@ async def list_ground_truths(
     datasetName: str,
     status: GroundTruthStatus | None = None,
     user: UserContext = Depends(get_current_user),
-) -> list[GroundTruthItem]:
+) -> list[AgenticGroundTruthEntry]:
     try:
         items = await container.repo.list_gt_by_dataset(datasetName, status)
     except ValueError as e:
@@ -428,7 +588,7 @@ async def get_ground_truth(
     bucket: UUID,
     item_id: str,
     user: UserContext = Depends(get_current_user),
-) -> GroundTruthItem:
+) -> AgenticGroundTruthEntry:
     item = await container.repo.get_gt(datasetName, bucket, item_id)
     if not item:
         raise HTTPException(status_code=404, detail="Item not found")
@@ -440,112 +600,66 @@ async def update_ground_truth(
     datasetName: str,
     bucket: UUID,
     item_id: str,
-    payload: dict[str, Any],
+    payload: GroundTruthUpdateRequest,
     user: UserContext = Depends(get_current_user),
     if_match: str | None = Header(default=None, alias="If-Match"),
-) -> GroundTruthItem:
+) -> AgenticGroundTruthEntry:
     it = await container.repo.get_gt(datasetName, bucket, item_id)
     if not it:
         raise HTTPException(status_code=404, detail="Item not found")
-    # Apply updates including references
-    for k in ["edited_question", "answer", "status"]:
-        if k in payload:
-            if k == "status":
-                # Coerce string status to GroundTruthStatus enum to keep the
-                # model consistent and avoid Pydantic serializer warnings.
-                try:
-                    val = payload[k]
-                    if isinstance(val, GroundTruthStatus):
-                        status_val = val
-                    else:
-                        status_val = GroundTruthStatus(val)
-                    setattr(it, "status", status_val)
-                except Exception:
-                    # Let Pydantic / API validation handle invalid values
-                    setattr(it, "status", payload[k])
-            else:
-                setattr(it, k, payload[k])
-    if "comment" in payload:
-        it.comment = payload["comment"]
-    if "refs" in payload and isinstance(payload["refs"], list):
-        # Validate minimal Reference structure; rely on Pydantic validation when saving
-        refs_payload = cast(list[Reference | dict[str, Any]], payload["refs"])
-        it.refs = [r if isinstance(r, Reference) else Reference(**r) for r in refs_payload]
-        # Reset totalReferences to force recalculation by model_validator
-        it.totalReferences = 0
-
-    # Tags handling: Only accept 'manualTags' in payload
-    # computedTags are system-generated and cannot be set by clients
-    # Explicitly reject 'computedTags' and legacy 'tags' fields
-    if "computedTags" in payload:
+    provided_fields = set(payload.model_fields_set)
+    payload_extras = payload.model_extra or {}
+
+    # Tags handling: Only accept 'manualTags' in the generic contract
+    if "computedTags" in payload_extras:
         raise HTTPException(
             status_code=400,
             detail="computedTags cannot be set directly; they are system-generated",
         )
-    if "tags" in payload:
+    if "tags" in payload_extras:
         raise HTTPException(
             status_code=400,
             detail="'tags' field is deprecated; use 'manualTags' instead",
         )
-    if "manualTags" in payload:
-        try:
-            it.manual_tags = payload["manualTags"]
-        except ValueError as e:
-            raise HTTPException(status_code=400, detail=str(e))
-
-    # History (with refs in agent messages)
-    if "history" in payload:
-        if payload["history"] is None:
-            it.history = None
-            # Reset totalReferences to force recalculation by model_validator
-            it.totalReferences = 0
-        elif isinstance(payload["history"], list):
-            try:
-                # Convert dict representations to HistoryItem models
-                history_items = []
-                for h in payload["history"]:
-                    # Parse refs if present in the history item
-                    refs_data = h.get("refs")
-                    refs_list = None
-                    if refs_data is not None:
-                        refs_list = [
-                            r if isinstance(r, Reference) else Reference(**r) for r in refs_data
-                        ]
-                    # Parse expected_behavior if present in the history item
-                    expected_behavior_data = h.get("expected_behavior") or h.get("expectedBehavior")
-                    history_items.append(
-                        HistoryItem(
-                            role=h["role"],
-                            msg=h.get("msg")
-                            or h.get("content", ""),  # Support both 'msg' and 'content'
-                            refs=refs_list,
-                            expected_behavior=expected_behavior_data
-                            if isinstance(expected_behavior_data, list)
-                            else None,
-                        )
-                    )
-                it.history = history_items
-                # Reset totalReferences to force recalculation by model_validator
-                it.totalReferences = 0
-            except (KeyError, ValueError) as e:
-                raise HTTPException(status_code=400, detail=f"Invalid history format: {str(e)}")
-    # Concurrency: use If-Match header or etag field in body
-    provided_etag: str | None = cast(str | None, payload.get("etag"))
-    if not if_match and not provided_etag:
-        # Require ETag to perform update
-        raise HTTPException(status_code=412, detail="ETag required")
-    if if_match:
-        it.etag = if_match
-    elif provided_etag:
-        it.etag = provided_etag
     try:
-        # Apply computed tags before saving
-        apply_computed_tags(it)
-        await container.repo.upsert_gt(it)
+        apply_shared_update(
+            it,
+            provided_fields=provided_fields,
+            comment=payload.comment,
+            history_entries=payload.history,
+            context_entries=payload.context_entries,
+            tool_calls=payload.tool_calls,
+            expected_tools=payload.expected_tools,
+            feedback=payload.feedback,
+            metadata=payload.metadata,
+            plugins=payload.plugins,
+            trace_ids=payload.trace_ids,
+            trace_payload=payload.trace_payload,
+            scenario_id=payload.scenario_id,
+            manual_tags=payload.manual_tags,
+            status=payload.status,
+            actor_user_id=user.user_id,
+            legacy_update=read_legacy_compat_update(payload_extras),
+        )
+    except ValidationError as e:
+        raise HTTPException(status_code=400, detail=e.message)
+    try:
+        await persist_shared_update(
+            container.repo,
+            it,
+            if_match=if_match,
+            payload_etag=payload.etag,
+        )
+    except ApprovalValidationError as e:
+        raise HTTPException(
+            status_code=400, detail={"code": "INVALID_APPROVAL", "errors": e.errors}
+        )
+    except ETagRequiredError:
+        raise HTTPException(status_code=412, detail="ETag required")
+    except ETagMismatchError:
+        raise HTTPException(status_code=412, detail="ETag mismatch")
     except ValueError as e:
-        if str(e) == "etag_mismatch":
-            raise HTTPException(status_code=412, detail="ETag mismatch")
-        raise
+        raise HTTPException(status_code=400, detail=str(e))
     # Fetch and return the updated item so the response includes the fresh ETag
     # and any server-populated fields (mirrors assignments API behavior).
     latest = await container.repo.get_gt(datasetName, bucket, item_id)
diff --git a/backend/app/api/v1/router.py b/backend/app/api/v1/router.py
index 6247483..a6ef100 100644
--- a/backend/app/api/v1/router.py
+++ b/backend/app/api/v1/router.py
@@ -8,7 +8,6 @@
     tags,
     datasets,
     search,
-    chat,
 )
 from app.api.v1 import config
 
@@ -22,4 +21,3 @@
 api_router.include_router(tags.router, prefix="", tags=["tags"])  # /tags endpoints
 api_router.include_router(datasets.router, prefix="", tags=["datasets"])  # /datasets endpoints
 api_router.include_router(search.router, prefix="", tags=["search"])  # /search endpoint
-api_router.include_router(chat.router, prefix="", tags=["chat"])  # /chat endpoint
diff --git a/backend/app/api/v1/stats.py b/backend/app/api/v1/stats.py
index f40a9b0..ba90cca 100644
--- a/backend/app/api/v1/stats.py
+++ b/backend/app/api/v1/stats.py
@@ -10,7 +10,8 @@
 
 @router.get("/ground-truths/stats")
 async def get_stats(user: UserContext = Depends(get_current_user)):
-    return await container.repo.stats()
+    base_stats = await container.repo.stats()
+    return container.plugin_pack_registry.collect_stats(base_stats.model_dump())
 
 
 # todo: add endpoint for all user stats
diff --git a/backend/app/api/v1/update_models.py b/backend/app/api/v1/update_models.py
new file mode 100644
index 0000000..030b0a7
--- /dev/null
+++ b/backend/app/api/v1/update_models.py
@@ -0,0 +1,14 @@
+from __future__ import annotations
+
+from pydantic import BaseModel, ConfigDict
+
+
+class HistoryEntryPatch(BaseModel):
+    model_config = ConfigDict(
+        title="HistoryEntryPatch",
+        populate_by_name=True,
+        extra="allow",
+    )
+
+    role: str
+    msg: str | None = None
diff --git a/backend/app/container.py b/backend/app/container.py
index 5c91f68..1054d48 100644
--- a/backend/app/container.py
+++ b/backend/app/container.py
@@ -11,9 +11,6 @@
 from app.services.curation_service import CurationService
 from app.adapters.repos.tags_repo import CosmosTagsRepo
 from app.services.tag_registry_service import TagRegistryService
-from app.adapters.gtc_inference_adapter import GTCInferenceAdapter
-from app.services.chat_service import ChatService
-from app.adapters.agent_steps_store import AgentStepsStore
 from app.exports.formatters.json_items import JsonItemsFormatter
 from app.exports.formatters.json_snapshot_payload import JsonSnapshotPayloadFormatter
 from app.exports.pipeline import ExportPipeline
@@ -26,11 +23,36 @@
 from app.exports.storage.base import ExportStorage
 from app.exports.storage.blob import BlobExportStorage
 from app.exports.storage.local import LocalExportStorage
+from app.plugins.pack_registry import get_default_pack_registry
+from app.plugins.base import PluginPackRegistry
+from app.adapters.repos.memory_repo import InMemoryGroundTruthRepo
+from app.adapters.search.demo_search import DemoSearchAdapter
+from app.demo_seed import DEMO_CURATION_INSTRUCTIONS, build_demo_items
 
 
 logger = logging.getLogger("gtc.container")
 
 
+class InMemoryTagsRepo:
+    def __init__(self) -> None:
+        self.tags: list[str] = []
+
+    async def get_global_tags(self) -> list[str]:
+        return list(self.tags)
+
+    async def save_global_tags(self, tags: list[str]) -> list[str]:
+        self.tags = sorted(set(tags))
+        return list(self.tags)
+
+    async def upsert_add(self, tags_to_add: list[str]) -> list[str]:
+        return await self.save_global_tags([*self.tags, *tags_to_add])
+
+    async def upsert_remove(self, tags_to_remove: list[str]) -> list[str]:
+        remove = {str(tag) for tag in tags_to_remove}
+        self.tags = [tag for tag in self.tags if tag not in remove]
+        return list(self.tags)
+
+
 class Container:
     # Class-level annotations so static checkers understand intended types
     repo: GroundTruthRepo
@@ -41,14 +63,12 @@ class Container:
     tag_registry_service: TagRegistryService
     tags_repo: CosmosTagsRepo
     tag_definitions_repo: Any  # CosmosTagDefinitionsRepo
-    inference_service: GTCInferenceAdapter | None
-    chat_service: ChatService
-    agent_steps_store: AgentStepsStore | None
     export_pipeline: ExportPipeline
     export_processor_registry: ExportProcessorRegistry
     export_formatter_registry: ExportFormatterRegistry
     export_storage: ExportStorage
     export_default_processor_order: list[str]
+    plugin_pack_registry: PluginPackRegistry
 
     def __init__(self) -> None:
         # Lazily initialize repo and services. Tests and app lifespan will call
@@ -62,18 +82,14 @@ def __init__(self) -> None:
         self.tags_repo = cast(CosmosTagsRepo, None)
         self.tag_definitions_repo = cast(Any, None)
         self.tag_registry_service = cast(TagRegistryService, None)
-        self.inference_service = None  # Lazily initialized by init_chat()
-        self.agent_steps_store = cast(AgentStepsStore | None, None)
         self.export_storage = self._build_export_storage()
         self.export_processor_registry = self._build_export_processor_registry()
         self.export_formatter_registry = self._build_export_formatter_registry()
         self.export_pipeline = ExportPipeline(self.export_storage)
         self.export_default_processor_order = parse_processor_order(settings.EXPORT_PROCESSOR_ORDER)
-        self.chat_service = ChatService(
-            inference_service=None,
-            steps_store=self.agent_steps_store,
-            store_steps=settings.STORE_AGENT_STEPS,
-        )
+        # Plugin-pack registry — lazily populated on first use (startup_cosmos
+        # calls validate_all() to ensure all packs pass their startup checks).
+        self.plugin_pack_registry = get_default_pack_registry()
 
     def _build_default_credential(self) -> Any:
         """Create a DefaultAzureCredential for runtime use.
@@ -117,6 +133,16 @@ def _build_export_formatter_registry(self) -> ExportFormatterRegistry:
         )
         return registry
 
+    def _build_snapshot_service(self, repo: GroundTruthRepo) -> SnapshotService:
+        return SnapshotService(
+            repo,
+            export_pipeline=self.export_pipeline,
+            processor_registry=self.export_processor_registry,
+            formatter_registry=self.export_formatter_registry,
+            default_processor_order=self.export_default_processor_order,
+            plugin_export_transforms=self.plugin_pack_registry.collect_export_transforms(),
+        )
+
     def init_cosmos_repo(self, db_name: str | None = None) -> None:
         """Create a Cosmos repo instance and wire services.
 
@@ -170,13 +196,7 @@ def init_cosmos_repo(self, db_name: str | None = None) -> None:
         self.assignment_service = AssignmentService(self.repo)
         # Keep existing search service (may already be wired with adapter)
         self.search_service = self.search_service or SearchService()
-        self.snapshot_service = SnapshotService(
-            self.repo,
-            export_pipeline=self.export_pipeline,
-            processor_registry=self.export_processor_registry,
-            formatter_registry=self.export_formatter_registry,
-            default_processor_order=self.export_default_processor_order,
-        )
+        self.snapshot_service = self._build_snapshot_service(self.repo)
         self.curation_service = CurationService(self.repo)
         # Initialize tags repo and service (shares the same Cosmos account/db)
         self.tags_repo = CosmosTagsRepo(
@@ -201,6 +221,28 @@ def init_cosmos_repo(self, db_name: str | None = None) -> None:
             credential=credential,
         )
 
+    def init_memory_repo(self, *, enable_demo_data: bool = False) -> None:
+        demo_items = build_demo_items(settings.DEMO_USER_ID) if enable_demo_data else []
+        demo_instructions = DEMO_CURATION_INSTRUCTIONS if enable_demo_data else []
+        self.repo = InMemoryGroundTruthRepo(
+            items=demo_items,
+            curation_instructions=demo_instructions,
+        )
+        self.assignment_service = AssignmentService(self.repo)
+        self.snapshot_service = self._build_snapshot_service(self.repo)
+        self.curation_service = CurationService(self.repo)
+        self.tags_repo = cast(CosmosTagsRepo, InMemoryTagsRepo())
+        self.tag_registry_service = TagRegistryService(self.tags_repo)
+        self.tag_definitions_repo = cast(Any, None)
+        self.search_service = (
+            SearchService(DemoSearchAdapter(demo_items)) if enable_demo_data else SearchService()
+        )
+        logger.info(
+            "Using InMemoryGroundTruthRepo (demo_mode=%s, items=%s)",
+            enable_demo_data,
+            len(demo_items),
+        )
+
     async def startup_cosmos(self, db_name: str | None = None) -> None:
         """Initialize and validate Cosmos repos and services.
 
@@ -239,10 +281,27 @@ async def startup_cosmos(self, db_name: str | None = None) -> None:
         await self.tag_definitions_repo.validate_container()
         logger.info("Cosmos DB validation passed.")
 
+        # Step 4: Run plugin-pack startup validation so misconfigured packs
+        # fail here with an actionable error rather than silently at runtime.
+        logger.info("Running plugin-pack startup validation...")
+        self.plugin_pack_registry.validate_all()
+        logger.info(
+            "Plugin-pack validation passed. Registered packs: %s",
+            self.plugin_pack_registry.names(),
+        )
+
     def init_search(self) -> None:
         """Configure search adapter if Azure Search settings are present."""
         from app.adapters.search.azure_ai_search import AzureAISearchAdapter
 
+        if (
+            settings.REPO_BACKEND == "memory"
+            and settings.DEMO_MODE
+            and getattr(self.search_service, "adapter", None) is not None
+        ):
+            logger.info("Using demo search adapter for memory-backed demo mode")
+            return
+
         if settings.AZ_SEARCH_ENDPOINT and settings.AZ_SEARCH_INDEX:
             token_cred = None
             if not settings.AZ_SEARCH_KEY:
@@ -269,71 +328,5 @@ def init_search(self) -> None:
             self.search_service = self.search_service or SearchService()
             logger.info("Search adapter not configured; using no-op SearchService")
 
-    def _build_sync_default_credential(self) -> Any:
-        """Create a sync DefaultAzureCredential for runtime use.
-        Used for GTCInferenceAdapter which requires sync credentials.
-        """
-        try:
-            from azure.identity import DefaultAzureCredential
-        except Exception as e:
-            raise RuntimeError(f"azure-identity not installed: {e}")
-        # Exclude shared cache for server scenarios to keep minimal surface
-        return DefaultAzureCredential(exclude_shared_token_cache_credential=True)
-
-    def init_chat(self) -> None:
-        """Configure chat inference service and chat service.
-
-        Validates that retrieval configuration is present when agent is configured.
-        Uses managed identity for both agent auth and retrieval token minting.
-        """
-        project_endpoint = settings.AZURE_AI_PROJECT_ENDPOINT
-        agent_id = settings.AZURE_AI_AGENT_ID
-        retrieval_url = settings.RETRIEVAL_URL
-        permissions_scope = settings.RETRIEVAL_PERMISSIONS_SCOPE
-
-        # Only build the inference service when fully configured.
-        if not project_endpoint or not agent_id:
-            self.inference_service = None
-        elif not retrieval_url or not permissions_scope:
-            logger.error(
-                "Agent is configured but retrieval settings missing. "
-                "Set GTC_RETRIEVAL_URL and GTC_RETRIEVAL_PERMISSIONS_SCOPE."
-            )
-            # Mark as not configured so we fail at runtime with 502
-            self.inference_service = None
-        else:
-            # Use sync DefaultAzureCredential for GTCInferenceAdapter
-            # (reused for both agent auth and retrieval token minting)
-            credential = self._build_sync_default_credential()
-
-            self.inference_service = GTCInferenceAdapter(
-                project_endpoint=project_endpoint,
-                agent_id=agent_id,
-                retrieval_url=retrieval_url,
-                permissions_scope=permissions_scope,
-                timeout_seconds=settings.RETRIEVAL_TIMEOUT_SECONDS,
-                credential=credential,
-            )
-
-        # Reuse any existing steps store instance (may be configured elsewhere)
-        store = getattr(self, "agent_steps_store", None)
-        self.chat_service = ChatService(
-            inference_service=self.inference_service,
-            steps_store=store,
-            store_steps=settings.STORE_AGENT_STEPS and bool(store),
-        )
-
-        if not settings.CHAT_ENABLED:
-            self.chat_service.set_store_steps(False)
-            logger.info("Chat service disabled via settings")
-        elif not self.inference_service:
-            logger.info("Chat service running in mock mode (agent not configured)")
-        else:
-            logger.info(
-                "Chat service configured with Azure AI Project (endpoint=%s, agent=%s)",
-                settings.AZURE_AI_PROJECT_ENDPOINT,
-                settings.AZURE_AI_AGENT_ID,
-            )
-
 
 container = Container()
diff --git a/backend/app/core/config.py b/backend/app/core/config.py
index de36b51..36228b2 100644
--- a/backend/app/core/config.py
+++ b/backend/app/core/config.py
@@ -31,6 +31,14 @@ class Settings(BaseSettings):
 
     # Repository backend selection
     REPO_BACKEND: str = "memory"  # memory|cosmos
+    DEMO_MODE: bool = Field(
+        default=False,
+        validation_alias=AliasChoices("GTC_DEMO_MODE", "DEMO_MODE", "VITE_DEMO_MODE"),
+    )
+    DEMO_USER_ID: str = Field(
+        default="anonymous",
+        validation_alias=AliasChoices("GTC_DEMO_USER_ID", "DEMO_USER_ID", "VITE_DEV_USER_ID"),
+    )
 
     # Cosmos DB configuration (for cosmos backend)
     COSMOS_ENDPOINT: str | None = None
@@ -68,18 +76,6 @@ class Settings(BaseSettings):
     SEARCH_FIELD_TITLE: str = "title"
     SEARCH_FIELD_CHUNK: str = "chunk"
 
-    # Agent chat settings (Azure AI Foundry Agent Service)
-    CHAT_ENABLED: bool = True
-    AZURE_AI_PROJECT_ENDPOINT: str | None = None
-    AZURE_AI_AGENT_ID: str | None = None
-    AGENT_TIMEOUT_SECONDS: int = 30
-    STORE_AGENT_STEPS: bool = False
-
-    # Retrieval service settings (for FunctionTool-based agent retrieval)
-    RETRIEVAL_URL: str | None = None
-    RETRIEVAL_PERMISSIONS_SCOPE: str | None = None
-    RETRIEVAL_TIMEOUT_SECONDS: int = 30
-
     # Optional static frontend serving
     FRONTEND_DIR: str | None = None  # Absolute path inside container (e.g., /app/frontend)
     FRONTEND_INDEX: str = "index.html"
@@ -164,6 +160,7 @@ def _validate_export_storage(self) -> "Settings":
         ),
     )
     SERVICE_NAME: str = "gtc-backend"
+    HARNESS_JSONL_ENABLED: bool = False
 
     # Azure Container Apps Easy Auth (ACA) settings
     # When enabled, we expect ACA to inject identity headers (X-MS-CLIENT-PRINCIPAL, etc.).
diff --git a/backend/app/core/harness_observability.py b/backend/app/core/harness_observability.py
new file mode 100644
index 0000000..a08eee7
--- /dev/null
+++ b/backend/app/core/harness_observability.py
@@ -0,0 +1,136 @@
+from __future__ import annotations
+
+import json
+from datetime import UTC, datetime
+from pathlib import Path
+from time import perf_counter
+from uuid import uuid4
+
+from fastapi import FastAPI, Request
+
+from .config import REPO_ROOT, settings
+
+HARNESS_DIR = REPO_ROOT.parent / ".harness"
+LOG_PATH = HARNESS_DIR / "logs.jsonl"
+TRACE_PATH = HARNESS_DIR / "traces.jsonl"
+
+
+def _append_jsonl(path: Path, entry: dict[str, object | None]) -> None:
+    HARNESS_DIR.mkdir(parents=True, exist_ok=True)
+    with path.open("a", encoding="utf-8") as handle:
+        handle.write(json.dumps(entry, ensure_ascii=False) + "\n")
+
+
+def _utc_now() -> datetime:
+    return datetime.now(UTC)
+
+
+def _level_for_status(status_code: int) -> str:
+    if status_code >= 500:
+        return "ERROR"
+    if status_code >= 400:
+        return "WARN"
+    return "INFO"
+
+
+def _status_for_status(status_code: int) -> str:
+    return "error" if status_code >= 400 else "ok"
+
+
+def install_harness_jsonl_middleware(app: FastAPI) -> FastAPI:
+    @app.middleware("http")
+    async def _write_harness_events(request: Request, call_next):  # type: ignore[no-redef]
+        started_at = _utc_now()
+        started_perf = perf_counter()
+        trace_id = uuid4().hex
+        span_id = uuid4().hex[:16]
+        request_name = f"{request.method} {request.url.path}"
+
+        try:
+            response = await call_next(request)
+        except Exception as exc:
+            ended_at = _utc_now()
+            duration_ms = round((perf_counter() - started_perf) * 1000)
+            status_code = 500
+
+            _append_jsonl(
+                LOG_PATH,
+                {
+                    "ts": ended_at.isoformat(),
+                    "level": "ERROR",
+                    "msg": f"{request_name} 500",
+                    "service": settings.SERVICE_NAME,
+                    "trace_id": trace_id,
+                    "span_id": span_id,
+                    "duration_ms": duration_ms,
+                    "status": "error",
+                    "method": request.method,
+                    "path": request.url.path,
+                    "http_status": status_code,
+                    "error": exc.__class__.__name__,
+                },
+            )
+            _append_jsonl(
+                TRACE_PATH,
+                {
+                    "trace_id": trace_id,
+                    "span_id": span_id,
+                    "parent_id": None,
+                    "name": request_name,
+                    "service": settings.SERVICE_NAME,
+                    "start": started_at.isoformat(),
+                    "end": ended_at.isoformat(),
+                    "duration_ms": duration_ms,
+                    "status": "error",
+                    "method": request.method,
+                    "path": request.url.path,
+                    "http_status": status_code,
+                },
+            )
+            raise
+
+        ended_at = _utc_now()
+        duration_ms = round((perf_counter() - started_perf) * 1000)
+        level = _level_for_status(response.status_code)
+        status = _status_for_status(response.status_code)
+        error = "client error" if response.status_code >= 400 else None
+        if response.status_code >= 500:
+            error = "server error"
+
+        _append_jsonl(
+            LOG_PATH,
+            {
+                "ts": ended_at.isoformat(),
+                "level": level,
+                "msg": f"{request_name} {response.status_code}",
+                "service": settings.SERVICE_NAME,
+                "trace_id": trace_id,
+                "span_id": span_id,
+                "duration_ms": duration_ms,
+                "status": status,
+                "method": request.method,
+                "path": request.url.path,
+                "http_status": response.status_code,
+                "error": error,
+            },
+        )
+        _append_jsonl(
+            TRACE_PATH,
+            {
+                "trace_id": trace_id,
+                "span_id": span_id,
+                "parent_id": None,
+                "name": request_name,
+                "service": settings.SERVICE_NAME,
+                "start": started_at.isoformat(),
+                "end": ended_at.isoformat(),
+                "duration_ms": duration_ms,
+                "status": status,
+                "method": request.method,
+                "path": request.url.path,
+                "http_status": response.status_code,
+            },
+        )
+        return response
+
+    return app
diff --git a/backend/app/demo_seed.py b/backend/app/demo_seed.py
new file mode 100644
index 0000000..258509a
--- /dev/null
+++ b/backend/app/demo_seed.py
@@ -0,0 +1,637 @@
+from __future__ import annotations
+
+import json
+from datetime import datetime, timezone
+from typing import TypedDict
+from uuid import UUID
+
+from app.plugins.adapters.trace_export import TraceExportAdapter
+from app.domain.enums import GroundTruthStatus
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+    DatasetCurationInstructions,
+    ExpectedTools,
+    HistoryEntry,
+    HistoryItem,
+    Reference,
+    ToolExpectation,
+)
+
+CUSTOMER_FEEDBACK_BUCKET = UUID("11111111-1111-1111-1111-111111111111")
+NETWORK_DIAGNOSTICS_BUCKET = UUID("22222222-2222-2222-2222-222222222222")
+
+
+class DemoTraceConfig(TypedDict, total=False):
+    id: str
+    dataset: str
+    bucket: UUID
+    status: GroundTruthStatus
+    assigned: bool
+    scenario_id: str
+    manual_tags: list[str]
+    comment: str
+    refs: list[Reference]
+    required_tools: list[str]
+    reviewed_at: datetime
+    updated_by: str
+
+
+def _ts(year: int, month: int, day: int, hour: int = 12, minute: int = 0) -> datetime:
+    return datetime(year, month, day, hour, minute, tzinfo=timezone.utc)
+
+
+def _result(payload: dict[str, object]) -> str:
+    return json.dumps(payload)
+
+
+def _reference(url: str, title: str, content: str, *, excerpt: str | None = None) -> Reference:
+    return Reference(url=url, title=title, content=content, keyExcerpt=excerpt)
+
+
+def _trace(
+    *,
+    trace_id: str,
+    conversation_id: str,
+    feedback_at: datetime,
+    user_query: str,
+    chat_response: str,
+    rca: str,
+    tool_calls: list[dict[str, object]],
+    resolution: str,
+    additional_feedback: dict[str, int],
+    impacted_device: str,
+    feedback_type: str = "like",
+    comment: str = "",
+) -> dict[str, object]:
+    return {
+        "id": trace_id,
+        "cid_list": [conversation_id],
+        "uid": "[REDACTED_UID]",
+        "impacted_device_type": "MSISDN",
+        "impacted_device": impacted_device,
+        "metric_name": "user feedback",
+        "type": feedback_type,
+        "comment": comment,
+        "additional_feedback": additional_feedback,
+        "resolution": resolution,
+        "feedback_date": int(feedback_at.timestamp()),
+        "feedback_datetime_utc": feedback_at.isoformat(),
+        "chat_history": [
+            {
+                "user_query": user_query,
+                "chat_response": chat_response,
+                "rca": rca,
+                "context": tool_calls,
+            }
+        ],
+    }
+
+
+def _tool_call(
+    *,
+    tool_id: str,
+    run_id: str,
+    name: str,
+    arguments: str,
+    result: dict[str, object],
+    execution_time: float,
+) -> dict[str, object]:
+    return {
+        "id": tool_id,
+        "run_id": run_id,
+        "function_name": name,
+        "function_arguments": arguments,
+        "function_result": _result(result),
+        "execution_time": execution_time,
+    }
+
+
+DEMO_TRACE_EXPORTS: list[tuple[dict[str, object], DemoTraceConfig]] = [
+    (
+        _trace(
+            trace_id="a4d42fa7-a99c-47a4-8f04-11e80067b001",
+            conversation_id="demo-cid-001",
+            feedback_at=_ts(2026, 2, 18, 8, 57),
+            user_query="CX IS USING TOO MUCH DATA AND WANTS TO KNOW WHY",
+            chat_response=(
+                "The analysis shows a sustained increase in mobile data usage over the last week, with the "
+                "largest spikes occurring during evening video sessions. The account stayed on cellular data "
+                "for several long sessions and did not fall back to Wi-Fi."
+            ),
+            rca=(
+                "### Root Cause\nThe customer exceeded the 50 GB allowance on the Premium Data Plan after "
+                "streaming and hotspot usage stayed on cellular data. No provisioning or outage issue was found."
+            ),
+            resolution="CX WAS NOT ON WIFI AND OVERLY USING DATA",
+            impacted_device="[REDACTED_MSISDN_001]",
+            additional_feedback={
+                "The recommended resolution was correct and appropriate": 2,
+                "The explanation and investigation areas were relevant to the issue": 2,
+                "The explanation of how the RCA was reached was clear": 2,
+                "The RCA included all key information (nothing important was missed)": 2,
+            },
+            tool_calls=[
+                _tool_call(
+                    tool_id="tool-001",
+                    run_id="run-001",
+                    name="get_location",
+                    arguments="msisdn='[REDACTED_MSISDN_001]' context=None",
+                    result={
+                        "response": {
+                            "items": [{"valueObject": {"location": {"wifiConnected": False}}}]
+                        }
+                    },
+                    execution_time=2.41,
+                ),
+                _tool_call(
+                    tool_id="tool-002",
+                    run_id="run-001",
+                    name="get_plan_usage",
+                    arguments="msisdn='[REDACTED_MSISDN_001]' context=None",
+                    result={
+                        "response": {"items": [{"valueObject": {"planLimitGb": 50, "usageGb": 63}}]}
+                    },
+                    execution_time=1.83,
+                ),
+                _tool_call(
+                    tool_id="tool-003",
+                    run_id="run-001",
+                    name="Billing_agent",
+                    arguments="msisdn='[REDACTED_MSISDN_001]' context=None",
+                    result={"response": {"summary": "Overage charges align with plan cap breach."}},
+                    execution_time=1.17,
+                ),
+            ],
+        ),
+        {
+            "id": "demo-data-overage",
+            "dataset": "customer-feedback",
+            "bucket": CUSTOMER_FEEDBACK_BUCKET,
+            "status": GroundTruthStatus.draft,
+            "assigned": True,
+            "scenario_id": "feedback-data-overage",
+            "manual_tags": ["issue:data-usage", "resolution:wifi-education"],
+            "comment": "High-signal example that mirrors the redacted RCA flow.",
+            "refs": [
+                _reference(
+                    "https://telco.example.com/help/data-usage/check-usage",
+                    "Check mobile data usage",
+                    "Use the account usage view to compare current-cycle consumption against the plan cap.",
+                    excerpt="Current cycle usage can exceed the allowance before the invoice closes.",
+                ),
+                _reference(
+                    "https://telco.example.com/help/data-usage/wifi-assist",
+                    "Reduce cellular usage with Wi-Fi",
+                    "Encourage customers to enable Wi-Fi for streaming and hotspot-heavy sessions when available.",
+                    excerpt="Streaming over cellular is a common source of overage charges.",
+                ),
+            ],
+            "required_tools": ["get_location", "get_plan_usage", "Billing_agent"],
+        },
+    ),
+    (
+        _trace(
+            trace_id="b7d42fa7-a99c-47a4-8f04-11e80067b002",
+            conversation_id="demo-cid-002",
+            feedback_at=_ts(2026, 2, 21, 10, 12),
+            user_query="CUSTOMER SAYS HOTSPOT USAGE SPIKED OVER THE WEEKEND",
+            chat_response=(
+                "Weekend usage was concentrated on tethering sessions from a single handset. The usage pattern "
+                "matches a laptop hotspot workflow rather than background network activity."
+            ),
+            rca=(
+                "### Root Cause\nThe line used 18 GB of hotspot traffic across two tethered devices during a "
+                "road trip. The usage is expected and does not indicate fraud or a network defect."
+            ),
+            resolution="HOTSPOT TRAFFIC DROVE THE WEEKEND DATA SPIKE",
+            impacted_device="[REDACTED_MSISDN_002]",
+            additional_feedback={
+                "The recommended resolution was correct and appropriate": 2,
+                "The explanation and investigation areas were relevant to the issue": 2,
+                "The explanation of how the RCA was reached was clear": 1,
+                "The RCA included all key information (nothing important was missed)": 2,
+            },
+            tool_calls=[
+                _tool_call(
+                    tool_id="tool-101",
+                    run_id="run-002",
+                    name="get_device_details",
+                    arguments="msisdn='[REDACTED_MSISDN_002]' context=None",
+                    result={
+                        "response": {
+                            "items": [
+                                {"valueObject": {"deviceModel": "Phone X", "hotspotCapable": True}}
+                            ]
+                        }
+                    },
+                    execution_time=1.24,
+                ),
+                _tool_call(
+                    tool_id="tool-102",
+                    run_id="run-002",
+                    name="get_hotspot_usage",
+                    arguments="msisdn='[REDACTED_MSISDN_002]' context=None",
+                    result={
+                        "response": {
+                            "items": [{"valueObject": {"hotspotUsageGb": 18, "window": "48h"}}]
+                        }
+                    },
+                    execution_time=1.76,
+                ),
+                _tool_call(
+                    tool_id="tool-103",
+                    run_id="run-002",
+                    name="Data_agent",
+                    arguments="msisdn='[REDACTED_MSISDN_002]' context=None",
+                    result={"response": {"summary": "No anomaly detected beyond tethering usage."}},
+                    execution_time=1.09,
+                ),
+            ],
+        ),
+        {
+            "id": "demo-hotspot-weekend",
+            "dataset": "customer-feedback",
+            "bucket": CUSTOMER_FEEDBACK_BUCKET,
+            "status": GroundTruthStatus.draft,
+            "assigned": True,
+            "scenario_id": "feedback-hotspot-weekend",
+            "manual_tags": ["issue:hotspot-usage", "channel:billing"],
+            "comment": "Assigned draft focused on hotspot RCA and customer coaching.",
+            "refs": [
+                _reference(
+                    "https://telco.example.com/help/hotspot/usage-breakdown",
+                    "Understand hotspot usage",
+                    "Hotspot sessions are billed against the same plan cap and can create sharp short-term spikes.",
+                ),
+                _reference(
+                    "https://telco.example.com/help/hotspot/manage-devices",
+                    "Manage tethered devices",
+                    "Review device-level tethering behavior before escalating a data spike as suspicious.",
+                ),
+            ],
+            "required_tools": ["get_hotspot_usage", "Data_agent"],
+        },
+    ),
+    (
+        _trace(
+            trace_id="c8d42fa7-a99c-47a4-8f04-11e80067b003",
+            conversation_id="demo-cid-003",
+            feedback_at=_ts(2026, 2, 16, 14, 30),
+            user_query="CUSTOMER WAS CHARGED ROAMING FEES EVEN THOUGH THEY BOUGHT A PASS",
+            chat_response=(
+                "The roaming pass was purchased after the first day of travel, so the initial sessions were "
+                "billed at standard roaming rates. Later usage correctly applied the travel pass."
+            ),
+            rca=(
+                "### Root Cause\nCharges occurred before the pass activation timestamp. The network and "
+                "provisioning states were healthy, and billing aligned with the order timeline."
+            ),
+            resolution="ROAMING PASS ACTIVATED AFTER THE FIRST CHARGED SESSION",
+            impacted_device="[REDACTED_MSISDN_003]",
+            additional_feedback={
+                "The recommended resolution was correct and appropriate": 2,
+                "The explanation and investigation areas were relevant to the issue": 2,
+                "The explanation of how the RCA was reached was clear": 2,
+                "The RCA included all key information (nothing important was missed)": 2,
+            },
+            tool_calls=[
+                _tool_call(
+                    tool_id="tool-201",
+                    run_id="run-003",
+                    name="get_roaming_usage",
+                    arguments="msisdn='[REDACTED_MSISDN_003]' context=None",
+                    result={
+                        "response": {
+                            "items": [
+                                {"valueObject": {"chargedSessions": 3, "passCoveredSessions": 9}}
+                            ]
+                        }
+                    },
+                    execution_time=1.58,
+                ),
+                _tool_call(
+                    tool_id="tool-202",
+                    run_id="run-003",
+                    name="search_fixit_flows",
+                    arguments="query='roaming pass activation timeline'",
+                    result={
+                        "response": {"items": [{"title": "Roaming pass timing", "score": 0.94}]}
+                    },
+                    execution_time=0.89,
+                ),
+                _tool_call(
+                    tool_id="tool-203",
+                    run_id="run-003",
+                    name="Billing_agent",
+                    arguments="msisdn='[REDACTED_MSISDN_003]' context=None",
+                    result={
+                        "response": {"summary": "Billing timeline and pass activation are aligned."}
+                    },
+                    execution_time=1.11,
+                ),
+            ],
+        ),
+        {
+            "id": "demo-roaming-pass-timing",
+            "dataset": "network-diagnostics",
+            "bucket": NETWORK_DIAGNOSTICS_BUCKET,
+            "status": GroundTruthStatus.approved,
+            "assigned": False,
+            "scenario_id": "diagnostics-roaming-pass",
+            "manual_tags": ["issue:roaming", "status:approved-demo"],
+            "comment": "Approved example that demonstrates a clean billing RCA.",
+            "refs": [
+                _reference(
+                    "https://telco.example.com/help/roaming/travel-pass-timing",
+                    "Travel pass activation timing",
+                    "Travel passes only apply after activation and do not retroactively credit earlier sessions.",
+                )
+            ],
+            "required_tools": ["get_roaming_usage", "Billing_agent"],
+            "reviewed_at": _ts(2026, 2, 17, 9, 5),
+            "updated_by": "reviewer@example.com",
+        },
+    ),
+    (
+        _trace(
+            trace_id="d9d42fa7-a99c-47a4-8f04-11e80067b004",
+            conversation_id="demo-cid-004",
+            feedback_at=_ts(2026, 2, 24, 11, 45),
+            user_query="CUSTOMER LOST 5G AFTER A SIM SWAP AND WANTS TO KNOW IF PROVISIONING IS STUCK",
+            chat_response=(
+                "Provisioning records show the new SIM attached successfully, but the line has not yet completed "
+                "the final 5G feature refresh. The issue is recoverable through a backend refresh."
+            ),
+            rca=(
+                "### Root Cause\nThe SIM swap completed, but the 5G entitlement refresh did not run after the "
+                "change event. A provisioning retry is appropriate before deeper escalation."
+            ),
+            resolution="5G ENTITLEMENT REFRESH DID NOT COMPLETE AFTER SIM SWAP",
+            impacted_device="[REDACTED_MSISDN_004]",
+            additional_feedback={
+                "The recommended resolution was correct and appropriate": 1,
+                "The explanation and investigation areas were relevant to the issue": 2,
+                "The explanation of how the RCA was reached was clear": 1,
+                "The RCA included all key information (nothing important was missed)": 1,
+            },
+            tool_calls=[
+                _tool_call(
+                    tool_id="tool-301",
+                    run_id="run-004",
+                    name="get_subscription_status",
+                    arguments="msisdn='[REDACTED_MSISDN_004]' context=None",
+                    result={
+                        "response": {
+                            "items": [
+                                {"valueObject": {"simState": "active", "featureSet": ["LTE"]}}
+                            ]
+                        }
+                    },
+                    execution_time=1.42,
+                ),
+                _tool_call(
+                    tool_id="tool-302",
+                    run_id="run-004",
+                    name="qtm_provisioning_status_daily_query",
+                    arguments="msisdn='[REDACTED_MSISDN_004]' days=3",
+                    result={
+                        "response": {
+                            "items": [{"valueObject": {"lastRefresh": "failed", "attempts": 2}}]
+                        }
+                    },
+                    execution_time=2.03,
+                ),
+                _tool_call(
+                    tool_id="tool-303",
+                    run_id="run-004",
+                    name="Data_agent",
+                    arguments="msisdn='[REDACTED_MSISDN_004]' context=None",
+                    result={
+                        "response": {
+                            "summary": "Retry provisioning before opening a network ticket."
+                        }
+                    },
+                    execution_time=1.31,
+                ),
+            ],
+        ),
+        {
+            "id": "demo-sim-swap-refresh",
+            "dataset": "network-diagnostics",
+            "bucket": NETWORK_DIAGNOSTICS_BUCKET,
+            "status": GroundTruthStatus.skipped,
+            "assigned": False,
+            "scenario_id": "diagnostics-sim-swap-refresh",
+            "manual_tags": ["issue:provisioning", "status:needs-follow-up"],
+            "comment": "Skipped item keeps reclaim and retry flows visible in demo mode.",
+            "refs": [
+                _reference(
+                    "https://telco.example.com/help/provisioning/sim-swap-refresh",
+                    "Refresh service after SIM swap",
+                    "If feature entitlements lag a SIM swap, run a targeted refresh before escalating to network engineering.",
+                )
+            ],
+            "required_tools": ["get_subscription_status", "qtm_provisioning_status_daily_query"],
+        },
+    ),
+    (
+        _trace(
+            trace_id="ead42fa7-a99c-47a4-8f04-11e80067b005",
+            conversation_id="demo-cid-005",
+            feedback_at=_ts(2026, 2, 14, 17, 20),
+            user_query="CUSTOMER THINKS THERE WAS AN OUTAGE WHEN DATA SLOWED DOWN AT A STADIUM",
+            chat_response=(
+                "The network view shows a temporary congestion event around the venue with recovery later in the "
+                "evening. The account, device, and provisioning checks were otherwise healthy."
+            ),
+            rca=(
+                "### Root Cause\nTemporary cell congestion reduced throughput during a high-density event. "
+                "This was not caused by account misconfiguration, and no persistent defect remained afterward."
+            ),
+            resolution="SHORT-LIVED CELL CONGESTION DURING A HIGH-DENSITY EVENT",
+            impacted_device="[REDACTED_MSISDN_005]",
+            additional_feedback={
+                "The recommended resolution was correct and appropriate": 2,
+                "The explanation and investigation areas were relevant to the issue": 1,
+                "The explanation of how the RCA was reached was clear": 1,
+                "The RCA included all key information (nothing important was missed)": 1,
+            },
+            tool_calls=[
+                _tool_call(
+                    tool_id="tool-401",
+                    run_id="run-005",
+                    name="get_location",
+                    arguments="msisdn='[REDACTED_MSISDN_005]' context=None",
+                    result={"response": {"items": [{"valueObject": {"cellSector": "STADIUM-12"}}]}},
+                    execution_time=1.51,
+                ),
+                _tool_call(
+                    tool_id="tool-402",
+                    run_id="run-005",
+                    name="qtm_cellsector_ref_query",
+                    arguments="sector='STADIUM-12' hours=12",
+                    result={
+                        "response": {
+                            "items": [{"valueObject": {"congestionEvent": True, "peakUsers": 1840}}]
+                        }
+                    },
+                    execution_time=1.92,
+                ),
+                _tool_call(
+                    tool_id="tool-403",
+                    run_id="run-005",
+                    name="qtm_device_connectivity_kpis_7d_rolling_query",
+                    arguments="msisdn='[REDACTED_MSISDN_005]' days=7",
+                    result={
+                        "response": {
+                            "items": [{"valueObject": {"drops": 0, "attachSuccessRate": 0.99}}]
+                        }
+                    },
+                    execution_time=2.27,
+                ),
+            ],
+        ),
+        {
+            "id": "demo-stadium-congestion",
+            "dataset": "network-diagnostics",
+            "bucket": NETWORK_DIAGNOSTICS_BUCKET,
+            "status": GroundTruthStatus.deleted,
+            "assigned": False,
+            "scenario_id": "diagnostics-stadium-congestion",
+            "manual_tags": ["issue:congestion", "status:archived-demo"],
+            "comment": "Deleted sample preserves restore/delete flows with trace-heavy evidence.",
+            "refs": [
+                _reference(
+                    "https://telco.example.com/help/network/event-congestion",
+                    "Understand temporary event congestion",
+                    "Large venues can saturate nearby sectors briefly without indicating a persistent service problem.",
+                )
+            ],
+            "required_tools": ["get_location", "qtm_cellsector_ref_query"],
+        },
+    ),
+]
+
+
+def _hydrate_history_with_refs(item: AgenticGroundTruthEntry, refs: list[Reference]) -> None:
+    if not item.history:
+        return
+
+    enriched_history: list[HistoryEntry] = []
+    last_turn_index = len(item.history) - 1
+    for index, turn in enumerate(item.history):
+        enriched_history.append(
+            HistoryItem(
+                role=turn.role,
+                msg=turn.msg,
+                refs=refs if index == last_turn_index and turn.role != "user" else None,
+            )
+        )
+    item.history = enriched_history
+
+
+def _expected_tools(tool_names: list[str]) -> ExpectedTools:
+    return ExpectedTools(required=[ToolExpectation(name=name) for name in tool_names])
+
+
+def _build_demo_item(
+    trace: dict[str, object],
+    *,
+    item_id: str,
+    dataset: str,
+    bucket: UUID,
+    status: GroundTruthStatus,
+    demo_user_id: str,
+    assigned: bool,
+    scenario_id: str,
+    manual_tags: list[str],
+    comment: str,
+    refs: list[Reference],
+    required_tools: list[str],
+    reviewed_at: datetime | None = None,
+    updated_by: str | None = None,
+) -> AgenticGroundTruthEntry:
+    adapter = TraceExportAdapter(
+        dataset_name=dataset,
+        bucket=bucket,
+        status=status,
+        created_by="demo-seed",
+    )
+    adapted = adapter.adapt_payload({"trace_count": 1, "traces": [trace]})[0]
+    item = AgenticGroundTruthEntry.model_validate(adapted.model_dump(by_alias=True))
+
+    item.id = item_id
+    item.scenario_id = scenario_id
+    item.comment = comment
+    item.manual_tags = sorted(set(item.manual_tags + manual_tags))
+    item.metadata = {**item.metadata, "source": "demo-seed"}
+    item.trace_ids = {**(item.trace_ids or {}), "demoItemId": item_id}
+    item.refs = refs
+    _hydrate_history_with_refs(item, refs)
+    item.expected_tools = _expected_tools(required_tools)
+
+    if assigned:
+        item.assignedTo = demo_user_id
+        item.assigned_at = item.created_at
+
+    if reviewed_at is not None:
+        item.reviewed_at = reviewed_at
+        item.updated_at = reviewed_at
+    if updated_by is not None:
+        item.updatedBy = updated_by
+
+    return item
+
+
+def build_demo_items(demo_user_id: str) -> list[AgenticGroundTruthEntry]:
+    items: list[AgenticGroundTruthEntry] = []
+    for trace, config in DEMO_TRACE_EXPORTS:
+        items.append(
+            _build_demo_item(
+                trace,
+                item_id=config["id"],
+                dataset=config["dataset"],
+                bucket=config["bucket"],
+                status=config["status"],
+                demo_user_id=demo_user_id,
+                assigned=config["assigned"],
+                scenario_id=config["scenario_id"],
+                manual_tags=config["manual_tags"],
+                comment=config["comment"],
+                refs=config["refs"],
+                required_tools=config["required_tools"],
+                reviewed_at=config.get("reviewed_at"),
+                updated_by=config.get("updated_by"),
+            )
+        )
+    return items
+
+
+DEMO_CURATION_INSTRUCTIONS: list[DatasetCurationInstructions] = [
+    DatasetCurationInstructions(
+        id="curation-customer-feedback",
+        datasetName="customer-feedback",
+        bucket=UUID("00000000-0000-0000-0000-000000000000"),
+        instructions=(
+            "### Customer Feedback Demo Instructions\n\n"
+            "- Preserve the customer symptom exactly as reported before editing the RCA.\n"
+            "- Prefer explanations that tie plan limits, Wi-Fi usage, tethering, or billing timing back to the observed evidence.\n"
+            "- If the trace shows no defect, state that clearly and avoid inventing remediation work."
+        ),
+        updatedAt=_ts(2026, 2, 20, 8, 0),
+        updatedBy="demo-seed",
+    ),
+    DatasetCurationInstructions(
+        id="curation-network-diagnostics",
+        datasetName="network-diagnostics",
+        bucket=UUID("00000000-0000-0000-0000-000000000000"),
+        instructions=(
+            "### Network Diagnostics Demo Instructions\n\n"
+            "- Anchor the answer in the observed tool evidence and distinguish congestion, provisioning, and billing causes.\n"
+            "- Call out when a retry or refresh is appropriate before escalation.\n"
+            "- Keep RCA sections explicit so curators can quickly verify the reasoning path."
+        ),
+        updatedAt=_ts(2026, 2, 22, 8, 0),
+        updatedBy="demo-seed",
+    ),
+]
diff --git a/backend/app/domain/enums.py b/backend/app/domain/enums.py
index 355f4b8..4b8c9d0 100644
--- a/backend/app/domain/enums.py
+++ b/backend/app/domain/enums.py
@@ -28,19 +28,22 @@ class DocType(str, Enum):
 
 
 class HistoryItemRole(str, Enum):
+    """Legacy RAG compatibility enum.
+
+    The generic host model accepts arbitrary role strings; this enum remains only for
+    compatibility helpers, older tests, and the future RAG pack.
+    """
+
     user = "user"
     assistant = "assistant"
 
 
 class ExpectedBehavior(str, Enum):
-    """Expected behavior tags for history items in ground truth evaluation.
-
-    These tags describe what the agent should do at each turn of a conversation:
-    - tool:search: Agent should perform a search/retrieval operation
-    - generation:answer: Agent should generate a direct answer
-    - generation:need-context: Agent should ask for more context
-    - generation:clarification: Agent should ask for clarification
-    - generation:out-of-domain: Agent should indicate the query is out of domain
+    """Legacy RAG compatibility enum for history annotations.
+
+    The generic host model uses `expectedTools` plus plugin-owned data instead of fixed
+    top-level expected-behavior assumptions. This enum remains only for compatibility
+    helpers, older tests, and the future RAG pack.
     """
 
     tool_search = "tool:search"
diff --git a/backend/app/domain/models.py b/backend/app/domain/models.py
index bd67494..edab26f 100644
--- a/backend/app/domain/models.py
+++ b/backend/app/domain/models.py
@@ -1,174 +1,518 @@
 from __future__ import annotations
 
 from datetime import datetime, timezone
-from typing import Optional, cast
+from typing import Any, ClassVar, Optional, Literal, cast
 from uuid import UUID
 
-from pydantic import BaseModel, Field, ConfigDict, field_validator, model_validator, computed_field
+from pydantic import BaseModel, Field, ConfigDict, computed_field, field_validator, model_validator
 
-from app.domain.enums import GroundTruthStatus, HistoryItemRole, ExpectedBehavior
+from app.domain.enums import GroundTruthStatus
 from app.domain.validators import GroundTruthItemTagValidators
 
+LEGACY_HOST_FIELD_DELETE_GATES = (
+    "stored-data audit completed",
+    "caller audit completed",
+    "import/export verification completed",
+)
 
-class Reference(BaseModel):
-    """Wire reference object.
 
-    { url, title, content, keyExcerpt, type, bonus, messageIndex }
-    """
+class Reference(BaseModel):
+    """Legacy RAG reference object retained for compatibility helpers and tests."""
 
     url: str = Field(description="Reference URL (required, non-empty)")
     title: str | None = Field(default=None, description="Human-readable title for the reference")
     content: str | None = None
     keyExcerpt: str | None = None
     type: str | None = None
-    # Marks a reference as a "bonus"/optional citation for downstream consumers
     bonus: bool = False
-    # Which agent turn these refs belong to (optional)
     messageIndex: Optional[int] = None
 
     @field_validator("url")
     @classmethod
-    def validate_url_not_empty(_cls, v: str) -> str:
-        if not v or not v.strip():
+    def validate_url_not_empty(cls, value: str) -> str:
+        if not value or not value.strip():
             raise ValueError("Reference URL cannot be empty")
-        return v.strip()
+        return value.strip()
 
 
-class HistoryItem(BaseModel):
-    """Represents a single item in the multi-turn history."""
-
-    model_config = ConfigDict(populate_by_name=True)
+class HistoryEntry(BaseModel):
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
 
-    role: HistoryItemRole  # User or Assistant
+    role: str
     msg: str
-    refs: Optional[list[Reference]] = None  # References for agent messages
-    expected_behavior: Optional[list[ExpectedBehavior]] = Field(
-        default=None,
-        alias="expectedBehavior",
-        description="Expected behavior(s) for this turn in the conversation (e.g., tool:search, generation:answer)",
-    )
+
+    @field_validator("role", "msg")
+    @classmethod
+    def validate_non_empty_text(cls, value: str) -> str:
+        cleaned = value.strip()
+        if not cleaned:
+            raise ValueError("history fields cannot be empty")
+        return cleaned
+
+
+class HistoryItem(HistoryEntry):
+    """Legacy RAG-compatible history item retained for internal compatibility."""
+
+    refs: Optional[list[Reference]] = None
+    expected_behavior: Optional[list[str]] = Field(default=None, alias="expectedBehavior")
+
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
+
+
+class ContextEntry(BaseModel):
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
+
+    key: str
+    value: Any
+
+    @field_validator("key")
+    @classmethod
+    def validate_key(cls, value: str) -> str:
+        cleaned = value.strip()
+        if not cleaned:
+            raise ValueError("contextEntries[].key cannot be empty")
+        return cleaned
+
+
+class FeedbackEntry(BaseModel):
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
+
+    source: str = ""
+    values: dict[str, Any] = Field(default_factory=dict)
+
+
+class ToolCallRecord(BaseModel):
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
+
+    id: str = ""
+    name: str
+    call_type: Literal["tool", "subagent"] = Field("tool", alias="callType")
+    arguments: dict[str, Any] | None = None
+    agent: str | None = None
+    step_number: int | None = Field(None, alias="stepNumber")
+    parallel_group: str | None = Field(None, alias="parallelGroup")
+    parent_call_id: str | None = Field(None, alias="parentCallId")
+    response: Any = None
+
+    @field_validator("name")
+    @classmethod
+    def validate_name(cls, value: str) -> str:
+        cleaned = value.strip()
+        if not cleaned:
+            raise ValueError("toolCalls[].name cannot be empty")
+        return cleaned
+
+    @field_validator("step_number")
+    @classmethod
+    def validate_step_number(cls, value: int | None) -> int | None:
+        if value is not None and value < 0:
+            raise ValueError("toolCalls[].stepNumber cannot be negative")
+        return value
+
+
+class PluginPayload(BaseModel):
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
+
+    kind: str
+    version: str = "1.0"
+    data: dict[str, Any] = Field(default_factory=dict)
+
+    @field_validator("kind")
+    @classmethod
+    def validate_kind(cls, value: str) -> str:
+        cleaned = value.strip()
+        if not cleaned:
+            raise ValueError("plugins[].kind cannot be empty")
+        return cleaned
 
 
-class GroundTruthItem(GroundTruthItemTagValidators, BaseModel):
-    """Canonical Ground Truth item aligned to wire schema (schemaVersion v1).
+class ToolExpectation(BaseModel):
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
 
-    All fields with camelCase wire names use aliases; we accept both field names and aliases
-    on input (populate_by_name=True) and always serialize using by_alias.
+    name: str
+    arguments: dict[str, Any] | str | None = None
+
+    @field_validator("name")
+    @classmethod
+    def validate_name(cls, value: str) -> str:
+        cleaned = value.strip()
+        if not cleaned:
+            raise ValueError("expectedTools entries must include a non-empty name")
+        return cleaned
+
+
+class ExpectedTools(BaseModel):
+    """Tool expectations. Tools are implicitly allowed unless listed here."""
+
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
+
+    required: list[ToolExpectation] = Field(default_factory=list)
+    optional: list[ToolExpectation] = Field(default_factory=list)
+    not_needed: list[ToolExpectation] = Field(default_factory=list, alias="notNeeded")
+
+    @field_validator("required", "optional", "not_needed", mode="before")
+    @classmethod
+    def coerce_string_entries(cls, value: object) -> object:
+        if not isinstance(value, list):
+            return value
+        normalized: list[object] = []
+        for item in value:
+            normalized.append({"name": item} if isinstance(item, str) else item)
+        return normalized
+
+    @model_validator(mode="after")
+    def reject_overlap(self) -> ExpectedTools:
+        required_names = {tool.name for tool in self.required}
+        optional_names = {tool.name for tool in self.optional}
+        not_needed_names = {tool.name for tool in self.not_needed}
+        overlap = sorted(
+            (required_names & optional_names)
+            | (required_names & not_needed_names)
+            | (optional_names & not_needed_names)
+        )
+        if overlap:
+            raise ValueError(f"tools cannot appear in more than one category: {', '.join(overlap)}")
+        return self
+
+
+class RetrievalCandidate(BaseModel):
+    """A single retrieval result that can be associated with a specific tool call.
+
+    Supports per-tool-call ownership instead of flat top-level references,
+    and preserves the raw search payload alongside normalised fields.
     """
 
-    # Pydantic v2 config
-    model_config = ConfigDict(populate_by_name=True)
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
+
+    url: str
+    title: str | None = None
+    chunk: str | None = None
+    raw_payload: dict[str, Any] | None = Field(None, alias="rawPayload")
+    relevance: str | None = None
+    tool_call_id: str | None = Field(None, alias="toolCallId")
+
+    @field_validator("url")
+    @classmethod
+    def validate_url_not_empty(cls, value: str) -> str:
+        cleaned = value.strip()
+        if not cleaned:
+            raise ValueError("RetrievalCandidate.url cannot be empty")
+        return cleaned
+
+    @field_validator("relevance")
+    @classmethod
+    def validate_relevance(cls, value: str | None) -> str | None:
+        if value is None:
+            return None
+        allowed = {"relevant", "partially_relevant", "not_relevant"}
+        if value not in allowed:
+            raise ValueError(f"relevance must be one of {sorted(allowed)}, got '{value}'")
+        return value
+
+
+class AgenticGroundTruthEntry(GroundTruthItemTagValidators, BaseModel):
+    """Generic agentic-first host model.
+
+    The core contract intentionally exposes only the generic schema in OpenAPI. Legacy
+    RAG-shaped payloads are translated into this shape when validating this base class so
+    existing data can be carried forward without remaining top-level contract fields.
+    """
+
+    model_config = ConfigDict(populate_by_name=True, extra="forbid")
 
-    # Required core
-    # Accept either 'id' or 'uuid' on input; ID is required and generated by the API when missing
     id: str = Field(alias="id")
     datasetName: str = Field(alias="datasetName")
-    # UUID bucket assigned at import; optional on inbound until repo assigns
     bucket: Optional[UUID] = None
     status: GroundTruthStatus = GroundTruthStatus.draft
     docType: str = Field(default="ground-truth-item", alias="docType")
     schemaVersion: str = Field(default="v2", alias="schemaVersion")
 
-    # SME/curation
-    synth_question: str = Field(alias="synthQuestion")
-    edited_question: Optional[str] = Field(default=None, alias="editedQuestion")
-    answer: Optional[str] = None
-    refs: list[Reference] = cast("list[Reference]", Field(default_factory=list, alias="refs"))
-
-    # Tag fields: manualTags are user-provided, computedTags are system-generated
     manual_tags: list[str] = Field(default_factory=list, alias="manualTags")
     computed_tags: list[str] = Field(default_factory=list, alias="computedTags")
+    comment: str = ""
+
+    assignedTo: Optional[str] = Field(default=None, alias="assignedTo")
+    assigned_at: Optional[datetime] = Field(default=None, alias="assignedAt")
+    updated_at: datetime = Field(
+        default_factory=lambda: datetime.now(timezone.utc), alias="updatedAt"
+    )
+    updatedBy: Optional[str] = None
+    reviewed_at: Optional[datetime] = Field(default=None, alias="reviewedAt")
+    etag: Optional[str] = Field(default=None, alias="_etag")
+
+    scenario_id: str = Field(default="", alias="scenarioId")
+    history: list[HistoryEntry] = Field(default_factory=list)
+    context_entries: list[ContextEntry] = Field(default_factory=list, alias="contextEntries")
+
+    trace_ids: dict[str, str] | None = Field(default=None, alias="traceIds")
+    tool_calls: list[ToolCallRecord] = Field(default_factory=list, alias="toolCalls")
+    expected_tools: ExpectedTools = Field(default_factory=ExpectedTools, alias="expectedTools")
+
+    feedback: list[FeedbackEntry] = Field(default_factory=list)
+    metadata: dict[str, Any] = Field(default_factory=dict)
+    plugins: dict[str, PluginPayload] = Field(default_factory=dict)
+
+    created_by: str | None = Field(default=None, alias="createdBy")
+    created_at: datetime | None = Field(default=None, alias="createdAt")
+    trace_payload: dict[str, Any] = Field(default_factory=dict, alias="tracePayload")
+
+    _RAG_COMPAT_PLUGIN: ClassVar[str] = "rag-compat"
+
+    # --- Legacy compatibility layer ---
+    # The model_validator, computed_fields, and property accessors below exist because
+    # stored Cosmos DB documents may still carry top-level RAG fields (synthQuestion,
+    # editedQuestion, answer, refs, etc.). They transparently relocate those fields into
+    # plugins["rag-compat"] on read and re-expose them for internal code that still
+    # accesses .synth_question, .answer, .refs, .totalReferences.
+    #
+    # Hard-delete only after all LEGACY_HOST_FIELD_DELETE_GATES are satisfied. Until then,
+    # these accessors are migration projections, not long-term host ownership.
+
+    @model_validator(mode="before")
+    @classmethod
+    def translate_legacy_payload_for_core_model(cls, value: object) -> object:
+        if cls is not AgenticGroundTruthEntry:
+            return value
+        from app.plugins.packs.rag_compat import normalize_legacy_payload_for_core_model
+
+        return normalize_legacy_payload_for_core_model(value, plugin_name=cls._RAG_COMPAT_PLUGIN)
+
+    @model_validator(mode="after")
+    def restore_history_annotations(self) -> "AgenticGroundTruthEntry":
+        history_annotations = self._rag_compat_data().get("historyAnnotations")
+        if not isinstance(history_annotations, list) or not self.history:
+            return self
+
+        merged_history: list[HistoryEntry] = []
+        changed = False
+        for index, entry in enumerate(self.history):
+            annotation = history_annotations[index] if index < len(history_annotations) else None
+            if not isinstance(annotation, dict) or not annotation:
+                merged_history.append(entry)
+                continue
+
+            entry_payload = entry.model_dump(by_alias=True)
+            if "refs" in annotation:
+                entry_payload["refs"] = annotation["refs"]
+                changed = True
+            if "expectedBehavior" in annotation:
+                entry_payload["expectedBehavior"] = annotation["expectedBehavior"]
+                changed = True
+            merged_history.append(HistoryItem.model_validate(entry_payload))
+
+        if changed:
+            self.history = merged_history
+        return self
 
     @computed_field
     @property
     def tags(self) -> list[str]:
-        """Return a merged, sorted view of manual and computed tags."""
         merged = set(self.manual_tags or []) | set(self.computed_tags or [])
         return sorted(merged)
 
-    # Free-form curator notes
-    comment: Optional[str] = Field(default=None, alias="comment")
+    @computed_field(alias="synthQuestion")
+    @property
+    def compat_synth_question(self) -> str | None:
+        return self.synth_question
 
-    # Multi-turn
-    history: Optional[list[HistoryItem]] = None
+    @computed_field(alias="editedQuestion")
+    @property
+    def compat_edited_question(self) -> str | None:
+        return self.edited_question
 
-    # Generation/provenance
-    contextUsedForGeneration: Optional[str] = None
-    contextSource: Optional[str] = None
-    modelUsedForGeneration: Optional[str] = None
+    @computed_field(alias="answer")
+    @property
+    def compat_answer(self) -> str | None:
+        return self.answer
 
-    # Sampling fields
-    semanticClusterNumber: Optional[int] = None
-    weight: Optional[float] = None
-    samplingBucket: Optional[int] = None
-    questionLength: Optional[int] = None
+    @computed_field(alias="refs")
+    @property
+    def compat_refs(self) -> list[Reference]:
+        return self.refs
 
-    # Assignment & audit
-    assignedTo: Optional[str] = Field(default=None, alias="assignedTo")
-    assigned_at: Optional[datetime] = Field(default=None, alias="assignedAt")
-    updated_at: datetime = Field(
-        default_factory=lambda: datetime.now(timezone.utc), alias="updatedAt"
-    )
-    updatedBy: Optional[str] = None
-    reviewed_at: Optional[datetime] = Field(default=None, alias="reviewedAt")
-    etag: Optional[str] = Field(default=None, alias="_etag")
-    totalReferences: int = Field(default=0, alias="totalReferences")
+    @computed_field(alias="totalReferences")
+    @property
+    def compat_total_references(self) -> int:
+        return self.totalReferences
+
+    def set_plugin(self, slot: str, data: dict[str, Any], *, version: str = "1.0") -> None:
+        self.plugins[slot] = PluginPayload(kind=slot, version=version, data=data)
+
+    def get_plugin_data(self, slot: str) -> dict[str, Any] | None:
+        plugin = self.plugins.get(slot)
+        return None if plugin is None else plugin.data
+
+    def export_json_schema(self) -> dict[str, Any]:
+        return self.model_json_schema()
+
+    def _rag_compat_data(self) -> dict[str, Any]:
+        plugin = self.plugins.get(self._RAG_COMPAT_PLUGIN)
+        if plugin is None:
+            return {}
+        return plugin.data
+
+    def _set_rag_compat_value(self, key: str, value: Any) -> None:
+        plugin = self.plugins.get(self._RAG_COMPAT_PLUGIN)
+        if plugin is None:
+            plugin = PluginPayload(kind=self._RAG_COMPAT_PLUGIN, version="1.0", data={})
+            self.plugins[self._RAG_COMPAT_PLUGIN] = plugin
+        if value is None:
+            plugin.data.pop(key, None)
+        else:
+            plugin.data[key] = value
+
+    def _find_history_message(self, role: str, *, reverse: bool = False) -> str | None:
+        history = self.history or []
+        history_iterable = reversed(history) if reverse else history
+        for turn in history_iterable:
+            if turn.role == role and turn.msg:
+                return turn.msg
+        return None
+
+    def _find_last_agent_message(self) -> str | None:
+        """Return the last non-user history message (any agent role)."""
+        for turn in reversed(self.history or []):
+            if turn.role != "user" and turn.msg:
+                return turn.msg
+        return None
 
-    @model_validator(mode="after")
-    def compute_total_references_if_needed(self) -> "GroundTruthItem":
-        """Auto-compute totalReferences if it's at default (0).
-
-        This ensures totalReferences is correctly populated when items are created
-        in memory (e.g., in tests) without going through the repository layer.
-        For items loaded from database, this preserves the cached value.
-        """
-        if self.totalReferences == 0:
-            # Count references from history first (multi-turn conversations)
-            history_refs = sum(len(turn.refs or []) for turn in (self.history or []))
-            if history_refs == 0:
-                # Fall back to item-level refs (single-turn)
-                self.totalReferences = len(self.refs or [])
+    @property
+    def synth_question(self) -> str | None:
+        if "synth_question" in self.__dict__:
+            return cast(str | None, self.__dict__.get("synth_question"))
+        compat = self._rag_compat_data()
+        return cast(str | None, compat.get("synthQuestion")) or self._find_history_message("user")
+
+    @synth_question.setter
+    def synth_question(self, value: str | None) -> None:
+        if "synth_question" in getattr(type(self), "model_fields", {}):
+            self.__dict__["synth_question"] = value
+            return
+        self._set_rag_compat_value("synthQuestion", value)
+
+    @property
+    def edited_question(self) -> str | None:
+        if "edited_question" in self.__dict__:
+            return cast(str | None, self.__dict__.get("edited_question"))
+        compat = self._rag_compat_data()
+        return cast(str | None, compat.get("editedQuestion")) or self.synth_question
+
+    @edited_question.setter
+    def edited_question(self, value: str | None) -> None:
+        if "edited_question" in getattr(type(self), "model_fields", {}):
+            self.__dict__["edited_question"] = value
+            return
+        self._set_rag_compat_value("editedQuestion", value)
+
+    @property
+    def answer(self) -> str | None:
+        if "answer" in self.__dict__:
+            return cast(str | None, self.__dict__.get("answer"))
+        compat = self._rag_compat_data()
+        return cast(str | None, compat.get("answer")) or self._find_last_agent_message()
+
+    @answer.setter
+    def answer(self, value: str | None) -> None:
+        if "answer" in getattr(type(self), "model_fields", {}):
+            self.__dict__["answer"] = value
+            return
+        self._set_rag_compat_value("answer", value)
+
+    @property
+    def refs(self) -> list[Reference]:
+        direct_value = self.__dict__.get("refs")
+        if isinstance(direct_value, list):
+            return [
+                ref if isinstance(ref, Reference) else Reference.model_validate(ref)
+                for ref in direct_value
+            ]
+        from app.plugins.packs.rag_compat import compat_refs_from_payload
+
+        return cast(
+            list[Reference],
+            compat_refs_from_payload(
+                {
+                    "plugins": self.plugins,
+                    "toolCalls": self.tool_calls,
+                    "history": self.history,
+                },
+                plugin_name=self._RAG_COMPAT_PLUGIN,
+            ),
+        )
+
+    @refs.setter
+    def refs(self, value: list[Reference] | list[dict[str, Any]] | None) -> None:
+        if "refs" in getattr(type(self), "model_fields", {}):
+            self.__dict__["refs"] = list(value or [])
+            return
+        # Handle both Reference objects and dict representations
+        serialized = []
+        for ref in value or []:
+            if isinstance(ref, Reference):
+                serialized.append(ref.model_dump(by_alias=True))
+            elif isinstance(ref, dict):
+                # Validate and convert dict to ensure it's a valid reference
+                validated_ref = Reference.model_validate(ref)
+                serialized.append(validated_ref.model_dump(by_alias=True))
             else:
-                self.totalReferences = history_refs
-        return self
+                serialized.append(ref)
+        self._set_rag_compat_value("refs", serialized)
 
+    @property
+    def totalReferences(self) -> int:
+        direct_value = self.__dict__.get("totalReferences")
+        if isinstance(direct_value, int):
+            return direct_value
+        from app.plugins.packs.rag_compat import compat_total_references_from_payload
+
+        return compat_total_references_from_payload(
+            {
+                "plugins": self.plugins,
+                "toolCalls": self.tool_calls,
+                "history": self.history,
+            },
+            plugin_name=self._RAG_COMPAT_PLUGIN,
+        )
+
+    @totalReferences.setter
+    def totalReferences(self, value: int | None) -> None:
+        if "totalReferences" in getattr(type(self), "model_fields", {}):
+            self.__dict__["totalReferences"] = 0 if value is None else int(value)
+            return
+        self._set_rag_compat_value("totalReferences", None if value is None else int(value))
+
+    # NOTE: Informational RAG-era accessors (contextUsedForGeneration, contextSource,
+    # modelUsedForGeneration, semanticClusterNumber, weight, samplingBucket, questionLength)
+    # removed in Phase 7 legacy retirement. No callers accessed them via
+    # AgenticGroundTruthEntry uses computed properties for legacy field access.
+    # Read paths extract these values from history and plugin data. Write paths
+    # normalize incoming payloads into canonical multi-turn structures.
 
-class PaginationMetadata(BaseModel):
-    """Pagination metadata for list responses."""
 
+class PaginationMetadata(BaseModel):
     model_config = ConfigDict(populate_by_name=True)
 
     page: int = Field(description="Current page number (1-indexed)")
     limit: int = Field(description="Items per page")
     total: int = Field(description="Total number of items matching filters")
-    total_pages: int = Field(
-        alias="totalPages",
-        description="Total number of pages",
-    )
-    has_next: bool = Field(
-        alias="hasNext",
-        description="Whether there is a next page",
-    )
-    has_prev: bool = Field(
-        alias="hasPrev",
-        description="Whether there is a previous page",
-    )
+    total_pages: int = Field(alias="totalPages", description="Total number of pages")
+    has_next: bool = Field(alias="hasNext", description="Whether there is a next page")
+    has_prev: bool = Field(alias="hasPrev", description="Whether there is a previous page")
 
 
 class GroundTruthListResponse(BaseModel):
     model_config = ConfigDict(populate_by_name=True)
 
-    items: list[GroundTruthItem]
+    items: list[AgenticGroundTruthEntry]
     pagination: PaginationMetadata
 
 
 class AssignmentDocument(BaseModel):
-    id: str  # stable id: "<dataset>|<bucket>|<groundTruthId>"
-    pk: str  # SME id
-    ground_truth_id: str  # ground truth id
-    datasetName: str  # dataset name and bucket comprise the GT PK
+    id: str
+    pk: str
+    ground_truth_id: str
+    datasetName: str
     bucket: UUID
-    # Document metadata
     docType: str = Field(default="sme-assignment", alias="docType")
     schemaVersion: str = Field(default="v1", alias="schemaVersion")
 
@@ -180,8 +524,6 @@ class Stats(BaseModel):
 
 
 class BulkImportError(BaseModel):
-    """Structured error for bulk import failures."""
-
     model_config = ConfigDict(populate_by_name=True)
 
     index: int = Field(description="0-based position in request array")
@@ -191,16 +533,13 @@ class BulkImportError(BaseModel):
         description="ID of the failed item (if available)",
     )
     field: str | None = Field(
-        default=None,
-        description="Field that caused the error (if applicable)",
+        default=None, description="Field that caused the error (if applicable)"
     )
     code: str = Field(description="Error code: INVALID_TAG, DUPLICATE_ID, CREATE_FAILED, etc.")
     message: str = Field(description="Human-readable error description")
 
 
 class ValidationSummary(BaseModel):
-    """Summary statistics for bulk import."""
-
     model_config = ConfigDict(populate_by_name=True)
 
     total: int = Field(description="Total items in request")
@@ -208,31 +547,33 @@ class ValidationSummary(BaseModel):
     failed: int = Field(description="Items that failed")
 
 
-class BulkImportResult(BaseModel):
-    """Result for bulk import operations.
+class BulkImportPersistenceError(BaseModel):
+    model_config = ConfigDict(populate_by_name=True)
+
+    message: str = Field(description="Human-readable persistence error description")
+    item_id: str | None = Field(
+        default=None,
+        alias="itemId",
+        description="ID of the failed item when the repository can identify it",
+    )
+    persistence_index: int | None = Field(
+        default=None,
+        alias="persistenceIndex",
+        description="0-based position in the repository persistence batch",
+    )
 
-    - imported: number of successfully imported items
-    - errors: list of error markers/messages for failed items (order corresponds to failures only)
-    """
 
+class BulkImportResult(BaseModel):
     imported: int = 0
     errors: list[str] = Field(default_factory=list)
+    persistence_errors: list[BulkImportPersistenceError] = Field(default_factory=list)
 
 
 class DatasetCurationInstructions(BaseModel):
-    """Dataset-level curation instructions document (schemaVersion v1).
-
-    Stored in the same Cosmos container as ground-truth items using MultiHash PK
-    [/datasetName, /bucket] with bucket fixed to 0 and a stable id pattern
-    "curation-instructions|{datasetName}".
-    """
-
-    # Pydantic v2 config
     model_config = ConfigDict(populate_by_name=True)
 
     id: str
     datasetName: str = Field(alias="datasetName")
-    # Use NIL UUID for dataset-level docs bucket
     bucket: UUID = Field(default_factory=lambda: UUID("00000000-0000-0000-0000-000000000000"))
     docType: str = Field(default="curation-instructions", alias="docType")
     schemaVersion: str = Field(default="v1", alias="schemaVersion")
@@ -247,17 +588,12 @@ class DatasetCurationInstructions(BaseModel):
 
 
 class TagDefinition(BaseModel):
-    """Custom tag definition created by SMEs.
-
-    Stored in Cosmos DB tag_definitions container with partition key /tag_key.
-    """
-
     model_config = ConfigDict(populate_by_name=True)
 
-    id: str  # Same as tag_key for simplicity
-    tag_key: str  # Partition key (e.g., "source:custom_value")
+    id: str
+    tag_key: str
     description: str
-    created_by: str  # User ID/email who created the definition
+    created_by: str
     created_at: datetime = Field(
         default_factory=lambda: datetime.now(timezone.utc), alias="createdAt"
     )
diff --git a/backend/app/exports/registry.py b/backend/app/exports/registry.py
index 7a0b616..2b3959d 100644
--- a/backend/app/exports/registry.py
+++ b/backend/app/exports/registry.py
@@ -51,6 +51,20 @@ def resolve_chain(
         names = requested if requested is not None else (default_order or [])
         return [self.get(name) for name in names]
 
+    def apply_transforms(
+        self,
+        docs: list[dict[str, Any]],
+        transforms: list[Any] | None,
+    ) -> list[dict[str, Any]]:
+        if not transforms:
+            return docs
+
+        current_docs = [dict(doc) for doc in docs]
+        for transform in transforms:
+            transform_fn = getattr(transform, "transform", transform)
+            current_docs = [transform_fn(dict(doc)) for doc in current_docs]
+        return current_docs
+
 
 class ExportFormatterRegistry:
     def __init__(self) -> None:
diff --git a/backend/app/main.py b/backend/app/main.py
index bdfec6e..d2fcac5 100644
--- a/backend/app/main.py
+++ b/backend/app/main.py
@@ -13,6 +13,7 @@
 from app.core.config import log_settings
 from app.core.telemetry import init_telemetry
 from app.core.auth import install_ezauth_middleware, require_user
+from app.core.harness_observability import install_harness_jsonl_middleware
 from app.core.logging import setup_logging, user_logging_middleware, attach_trace_log_filter
 from app.api.v1.router import api_router
 from app.container import container
@@ -59,14 +60,21 @@ async def lifespan(app: FastAPI):
 
     # Lazily initialize repo (creates Cosmos DB/container if configured)
     try:
-        # In test mode, fixtures configure the repo/tags repo and we must not
-        # re-initialize here (it can rebind clients to a different event loop
-        # and overwrite per-test DB names). Only initialize in non-test mode.
-        if not config.settings.COSMOS_TEST_MODE:
-            # Wire repository and services based on configured backend
-            if config.settings.REPO_BACKEND.lower() == "cosmos":
+        repo_backend = config.settings.REPO_BACKEND.lower()
+
+        # In test mode, Cosmos fixtures configure the repo/tags repo and we must
+        # not re-initialize them here. Memory-backed demo mode is different: some
+        # tests intentionally clear the container and rely on lifespan startup to
+        # rebuild the in-memory services.
+        if repo_backend == "cosmos":
+            if not config.settings.COSMOS_TEST_MODE:
                 await container.startup_cosmos()
-            # Seed built-in tags into global tag registry (idempotent add)
+        elif repo_backend == "memory" and getattr(container, "repo", None) is None:
+            container.init_memory_repo(enable_demo_data=config.settings.DEMO_MODE)
+
+        # Seed built-in tags into global tag registry (idempotent add). Keep the
+        # existing test-mode guard so Cosmos integration fixtures stay isolated.
+        if not config.settings.COSMOS_TEST_MODE:
             try:
                 defaults = sorted(
                     f"{group}:{value}"
@@ -156,6 +164,10 @@ async def _healthz() -> dict[str, str]:
         # Never block app creation due to auth middleware
         pass
 
+    # Emit local harness JSONL telemetry during harness-enabled runs.
+    if config.settings.HARNESS_JSONL_ENABLED:
+        install_harness_jsonl_middleware(app)
+
     # Inject user identity into logging for every request (after auth middleware)
     try:
         user_logging_middleware(app)
@@ -170,11 +182,6 @@ async def _healthz() -> dict[str, str]:
     except Exception:
         # Don't block app creation if search isn't configured
         pass
-    try:
-        container.init_chat()
-    except Exception:
-        # Chat wiring is optional and should not block startup
-        pass
 
     # Convenience aliases at the root for Swagger UI
     # Root convenience redirects for docs should also enforce auth (same as /v1/docs)
diff --git a/backend/app/plugins/__init__.py b/backend/app/plugins/__init__.py
index 63d5c58..232616f 100644
--- a/backend/app/plugins/__init__.py
+++ b/backend/app/plugins/__init__.py
@@ -3,22 +3,55 @@
 This package contains plugin systems for the Ground Truth Curator.
 
 Subpackages:
-    - computed_tags: Plugin implementations (auto-discovered)
+    - computed_tags: Computed-tag plugin implementations (auto-discovered).
+    - adapters: Trace adapter plugin implementations (auto-discovered).
+    - packs: Plugin-pack implementations (startup-validated, approval-contributing).
 """
 
-from app.plugins.base import ComputedTagPlugin, TagPluginRegistry
+from app.plugins.base import (
+    ComputedTagPlugin,
+    PluginPack,
+    PluginPackRegistry,
+    TagPluginRegistry,
+    TraceAdapterPlugin,
+    TraceAdapterRegistry,
+)
 from app.plugins.registry import (
     create_default_registry,
     get_default_registry,
     reset_default_registry,
 )
+from app.plugins.pack_registry import (
+    create_default_pack_registry,
+    get_default_pack_registry,
+    reset_default_pack_registry,
+)
+from app.plugins.adapter_registry import (
+    create_default_adapter_registry,
+    get_default_adapter_registry,
+    reset_default_adapter_registry,
+)
 
 __all__ = [
-    # Base classes
+    # Base classes — computed-tag plugin
     "ComputedTagPlugin",
     "TagPluginRegistry",
-    # Registry functions
+    # Base classes — plugin pack
+    "PluginPack",
+    "PluginPackRegistry",
+    # Base classes — trace adapter plugin
+    "TraceAdapterPlugin",
+    "TraceAdapterRegistry",
+    # Tag plugin registry functions
     "create_default_registry",
     "get_default_registry",
     "reset_default_registry",
+    # Plugin-pack registry functions
+    "create_default_pack_registry",
+    "get_default_pack_registry",
+    "reset_default_pack_registry",
+    # Trace adapter registry functions
+    "create_default_adapter_registry",
+    "get_default_adapter_registry",
+    "reset_default_adapter_registry",
 ]
diff --git a/backend/app/plugins/adapter_registry.py b/backend/app/plugins/adapter_registry.py
new file mode 100644
index 0000000..02c248c
--- /dev/null
+++ b/backend/app/plugins/adapter_registry.py
@@ -0,0 +1,92 @@
+"""Default registry configuration for trace adapter plugins.
+
+This module provides functions to create and manage the global
+default trace adapter registry.  Plugins are automatically discovered
+from modules in the ``adapters`` sub-package.
+"""
+
+from __future__ import annotations
+
+import importlib
+import inspect
+import pkgutil
+import threading
+from pathlib import Path
+
+from app.plugins.base import TraceAdapterPlugin, TraceAdapterRegistry
+
+
+def _discover_adapters() -> list[type[TraceAdapterPlugin]]:
+    """Discover all TraceAdapterPlugin subclasses in the adapters package.
+
+    Scans all modules in the ``plugins/adapters/`` directory and finds
+    concrete classes that inherit from TraceAdapterPlugin.
+
+    Returns:
+        A list of adapter plugin classes (not instances).
+    """
+    adapters: list[type[TraceAdapterPlugin]] = []
+    package_dir = Path(__file__).parent / "adapters"
+
+    if not package_dir.is_dir():
+        return adapters
+
+    for module_info in pkgutil.iter_modules([str(package_dir)]):
+        if module_info.name == "__init__":
+            continue
+
+        module = importlib.import_module(f"app.plugins.adapters.{module_info.name}")
+
+        for _name, obj in inspect.getmembers(module, inspect.isclass):
+            if (
+                issubclass(obj, TraceAdapterPlugin)
+                and obj is not TraceAdapterPlugin
+                and not inspect.isabstract(obj)
+            ):
+                adapters.append(obj)
+
+    return adapters
+
+
+def create_default_adapter_registry() -> TraceAdapterRegistry:
+    """Create a TraceAdapterRegistry with all discovered adapter plugins.
+
+    Returns:
+        A registry pre-populated with all discovered adapters.
+    """
+    registry = TraceAdapterRegistry()
+    for adapter_cls in _discover_adapters():
+        registry.register(adapter_cls)
+    return registry
+
+
+# Global default registry instance
+_default_adapter_registry: TraceAdapterRegistry | None = None
+_adapter_registry_lock = threading.Lock()
+
+
+def get_default_adapter_registry() -> TraceAdapterRegistry:
+    """Get the global default trace adapter registry.
+
+    Creates the registry on first access (lazy initialization).
+    Thread-safe using double-checked locking pattern.
+
+    Returns:
+        The global TraceAdapterRegistry instance.
+    """
+    global _default_adapter_registry
+    if _default_adapter_registry is None:
+        with _adapter_registry_lock:
+            if _default_adapter_registry is None:
+                _default_adapter_registry = create_default_adapter_registry()
+    assert _default_adapter_registry is not None
+    return _default_adapter_registry
+
+
+def reset_default_adapter_registry() -> None:
+    """Reset the global default adapter registry.
+
+    Primarily used for testing to ensure a clean state.
+    """
+    global _default_adapter_registry
+    _default_adapter_registry = None
diff --git a/backend/app/plugins/adapters/__init__.py b/backend/app/plugins/adapters/__init__.py
new file mode 100644
index 0000000..2ecf7dc
--- /dev/null
+++ b/backend/app/plugins/adapters/__init__.py
@@ -0,0 +1 @@
+"""Trace adapter plugin implementations (auto-discovered)."""
diff --git a/backend/app/plugins/adapters/trace_export.py b/backend/app/plugins/adapters/trace_export.py
new file mode 100644
index 0000000..82afaef
--- /dev/null
+++ b/backend/app/plugins/adapters/trace_export.py
@@ -0,0 +1,317 @@
+from __future__ import annotations
+
+import ast
+import json
+import shlex
+from collections.abc import Mapping
+from datetime import datetime, timezone
+from typing import Any
+from uuid import UUID
+
+from app.domain.enums import GroundTruthStatus
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+    ContextEntry,
+    FeedbackEntry,
+    HistoryEntry,
+    ToolCallRecord,
+)
+from app.plugins.base import TraceAdapterPlugin
+
+
+def _clean_text(value: object) -> str:
+    if not isinstance(value, str):
+        return ""
+    return value.strip()
+
+
+def _slug(value: object) -> str:
+    cleaned = _clean_text(value).lower()
+    if not cleaned:
+        return ""
+    chars: list[str] = []
+    last_dash = False
+    for char in cleaned:
+        if char.isalnum():
+            chars.append(char)
+            last_dash = False
+            continue
+        if last_dash:
+            continue
+        chars.append("-")
+        last_dash = True
+    return "".join(chars).strip("-")
+
+
+def _coerce_datetime(trace: Mapping[str, Any]) -> datetime:
+    iso_value = trace.get("feedback_datetime_utc")
+    if isinstance(iso_value, str) and iso_value.strip():
+        return datetime.fromisoformat(iso_value.replace("Z", "+00:00"))
+
+    epoch_value = trace.get("feedback_date")
+    if isinstance(epoch_value, (int, float)):
+        return datetime.fromtimestamp(epoch_value, tz=timezone.utc)
+
+    return datetime.now(timezone.utc)
+
+
+def _parse_jsonish(value: object) -> Any:
+    if not isinstance(value, str):
+        return value
+
+    cleaned = value.strip()
+    if not cleaned:
+        return {}
+
+    if cleaned[0] in "{[":
+        try:
+            return json.loads(cleaned)
+        except json.JSONDecodeError:
+            return {"raw": cleaned}
+
+    return {"raw": cleaned}
+
+
+def _parse_tool_arguments(raw_arguments: object) -> dict[str, Any] | None:
+    if not isinstance(raw_arguments, str) or not raw_arguments.strip():
+        return None
+
+    try:
+        tokens = shlex.split(raw_arguments)
+    except ValueError:
+        return {"raw": raw_arguments}
+
+    parsed: dict[str, Any] = {}
+    for token in tokens:
+        if "=" not in token:
+            return {"raw": raw_arguments}
+        key, raw_value = token.split("=", 1)
+        key = key.strip()
+        if not key:
+            return {"raw": raw_arguments}
+        try:
+            parsed[key] = ast.literal_eval(raw_value)
+        except (ValueError, SyntaxError):
+            parsed[key] = raw_value
+    return parsed or {"raw": raw_arguments}
+
+
+def _build_tool_response(tool_call: Mapping[str, Any]) -> Any:
+    result = _parse_jsonish(tool_call.get("function_result"))
+    execution_time = tool_call.get("execution_time")
+    run_id = tool_call.get("run_id")
+
+    if execution_time is None and run_id is None:
+        return result
+
+    response: dict[str, Any] = {"result": result}
+    if execution_time is not None:
+        response["executionTimeSeconds"] = execution_time
+    if run_id is not None:
+        response["runId"] = run_id
+    return response
+
+
+def _build_history(chat_history: list[Mapping[str, Any]]) -> list[HistoryEntry]:
+    history: list[HistoryEntry] = []
+    for chat_item in chat_history:
+        user_query = _clean_text(chat_item.get("user_query"))
+        if user_query:
+            history.append(HistoryEntry(role="user", msg=user_query))
+
+        chat_response = _clean_text(chat_item.get("chat_response"))
+        if chat_response:
+            history.append(HistoryEntry(role="orchestrator-agent", msg=chat_response))
+
+        rca = _clean_text(chat_item.get("rca"))
+        if rca:
+            history.append(HistoryEntry(role="output-agent", msg=rca))
+    return history
+
+
+def _build_context_entries(trace: Mapping[str, Any], chat_history_count: int) -> list[ContextEntry]:
+    entries: list[ContextEntry] = []
+    for key in (
+        "uid",
+        "cid_list",
+        "impacted_device_type",
+        "impacted_device",
+        "metric_name",
+        "type",
+        "resolution",
+        "feedback_datetime_utc",
+    ):
+        value = trace.get(key)
+        if value in (None, "", []):
+            continue
+        entries.append(ContextEntry(key=key, value=value))
+
+    entries.append(ContextEntry(key="chat_history_count", value=chat_history_count))
+    return entries
+
+
+def _build_feedback(trace: Mapping[str, Any]) -> list[FeedbackEntry]:
+    additional_feedback = trace.get("additional_feedback")
+    summary_values = {
+        "metricName": trace.get("metric_name"),
+        "feedbackType": trace.get("type"),
+        "comment": trace.get("comment"),
+        "resolution": trace.get("resolution"),
+    }
+
+    feedback: list[FeedbackEntry] = [
+        FeedbackEntry(
+            source="trace-export-summary",
+            values={key: value for key, value in summary_values.items() if value not in (None, "")},
+        )
+    ]
+    if isinstance(additional_feedback, dict) and additional_feedback:
+        feedback.append(
+            FeedbackEntry(source="trace-export-ratings", values=dict(additional_feedback))
+        )
+    return feedback
+
+
+def _build_trace_ids(trace: Mapping[str, Any]) -> dict[str, str] | None:
+    trace_ids: dict[str, str] = {}
+    trace_id = _clean_text(trace.get("id"))
+    if trace_id:
+        trace_ids["traceId"] = trace_id
+
+    cid_list = trace.get("cid_list")
+    if isinstance(cid_list, list) and cid_list:
+        first_cid = cid_list[0]
+        if isinstance(first_cid, str) and first_cid.strip():
+            trace_ids["conversationId"] = first_cid.strip()
+
+    uid = _clean_text(trace.get("uid"))
+    if uid:
+        trace_ids["userId"] = uid
+
+    return trace_ids or None
+
+
+def _build_manual_tags(trace: Mapping[str, Any]) -> list[str]:
+    tags = [
+        "source:trace-export",
+        "workflow:agentic-rca",
+    ]
+    metric = _slug(trace.get("metric_name"))
+    feedback_type = _slug(trace.get("type"))
+    device_type = _slug(trace.get("impacted_device_type"))
+    if metric:
+        tags.append(f"metric:{metric}")
+    if feedback_type:
+        tags.append(f"feedback:{feedback_type}")
+    if device_type:
+        tags.append(f"device:{device_type}")
+    return sorted(set(tags))
+
+
+class TraceExportAdapter(TraceAdapterPlugin):
+    @property
+    def name(self) -> str:
+        return "trace-export"
+
+    def __init__(
+        self,
+        *,
+        dataset_name: str,
+        bucket: UUID | None = None,
+        status: GroundTruthStatus = GroundTruthStatus.draft,
+        created_by: str = "trace-export-adapter",
+    ) -> None:
+        self._dataset_name = dataset_name
+        self._bucket = bucket
+        self._status = status
+        self._created_by = created_by
+
+    def adapt_payload(
+        self, payload: Mapping[str, Any], **kwargs: Any
+    ) -> list[AgenticGroundTruthEntry]:
+        traces = payload.get("traces")
+        if not isinstance(traces, list):
+            raise ValueError("trace export payload must contain a 'traces' list")
+
+        items: list[AgenticGroundTruthEntry] = []
+        for index, trace in enumerate(traces, start=1):
+            if not isinstance(trace, Mapping):
+                raise ValueError(f"trace at index {index - 1} must be an object")
+            items.append(self.adapt_trace(trace, index=index))
+        return items
+
+    def adapt_trace(self, trace: Mapping[str, Any], *, index: int = 1) -> AgenticGroundTruthEntry:
+        trace_id = _clean_text(trace.get("id")) or f"trace-{index}"
+        created_at = _coerce_datetime(trace)
+
+        raw_chat_history = trace.get("chat_history")
+        if raw_chat_history is None:
+            chat_history: list[Mapping[str, Any]] = []
+        elif isinstance(raw_chat_history, list):
+            chat_history = [
+                chat_item for chat_item in raw_chat_history if isinstance(chat_item, Mapping)
+            ]
+        else:
+            raise ValueError(f"trace '{trace_id}' has a non-list chat_history value")
+
+        history = _build_history(chat_history)
+
+        tool_calls: list[ToolCallRecord] = []
+        for step_number, chat_item in enumerate(chat_history, start=1):
+            raw_context = chat_item.get("context")
+            if raw_context is None:
+                continue
+            if not isinstance(raw_context, list):
+                raise ValueError(
+                    f"trace '{trace_id}' chat_history[{step_number - 1}] context must be a list"
+                )
+            for raw_tool_call in raw_context:
+                if not isinstance(raw_tool_call, Mapping):
+                    raise ValueError(
+                        f"trace '{trace_id}' chat_history[{step_number - 1}] context entries must be objects"
+                    )
+                tool_calls.append(
+                    ToolCallRecord(
+                        id=_clean_text(raw_tool_call.get("id"))
+                        or f"{trace_id}:tool:{len(tool_calls) + 1}",
+                        name=_clean_text(raw_tool_call.get("function_name"))
+                        or f"tool-{len(tool_calls) + 1}",
+                        callType="tool",
+                        arguments=_parse_tool_arguments(raw_tool_call.get("function_arguments")),
+                        stepNumber=len(tool_calls) + 1,
+                        response=_build_tool_response(raw_tool_call),
+                    )
+                )
+
+        metadata = {
+            "sourceFormat": "trace-export",
+            "metricName": trace.get("metric_name"),
+            "feedbackType": trace.get("type"),
+            "chatHistoryCount": len(chat_history),
+            "toolCallCount": len(tool_calls),
+        }
+
+        return AgenticGroundTruthEntry.model_validate(
+            {
+                "id": f"trace-{trace_id}",
+                "datasetName": self._dataset_name,
+                "bucket": self._bucket,
+                "status": self._status,
+                "manualTags": _build_manual_tags(trace),
+                "comment": _clean_text(trace.get("comment"))
+                or _clean_text(trace.get("resolution")),
+                "updatedAt": created_at,
+                "createdAt": created_at,
+                "createdBy": self._created_by,
+                "scenarioId": f"trace-export:{trace_id}",
+                "history": history,
+                "contextEntries": _build_context_entries(trace, len(chat_history)),
+                "traceIds": _build_trace_ids(trace),
+                "toolCalls": tool_calls,
+                "feedback": _build_feedback(trace),
+                "metadata": {
+                    key: value for key, value in metadata.items() if value not in (None, "")
+                },
+                "tracePayload": dict(trace),
+            }
+        )
diff --git a/backend/app/plugins/base.py b/backend/app/plugins/base.py
index d84dcb3..008f6f4 100644
--- a/backend/app/plugins/base.py
+++ b/backend/app/plugins/base.py
@@ -1,16 +1,87 @@
-"""Base classes for the computed tags plugin system.
-
-This module defines the abstract base class for computed tag plugins
-and the registry for managing them.
+"""Base classes for the plugin system.
+
+This module defines:
+- ComputedTagPlugin: abstract base for computed-tag plugins
+- TagPluginRegistry: registry for computed-tag plugins
+- PluginPack: abstract base for broader plugin packs (validators, explorer contributions)
+- PluginPackRegistry: startup-validated registry for plugin packs
+- ExplorerFieldDefinition / ImportTransform / ExportTransform: supporting types
+  for the PluginPack extension surfaces added in Phase 1.
 """
 
 from __future__ import annotations
 
 from abc import ABC, abstractmethod
-from typing import TYPE_CHECKING
+from collections.abc import Mapping
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any, Callable
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
+
+
+# ---------------------------------------------------------------------------
+# Supporting types for plugin-pack extension surfaces
+# ---------------------------------------------------------------------------
+
+
+@dataclass(frozen=True)
+class ExplorerFieldDefinition:
+    """Describes a plugin-contributed column or filter in the explorer view.
+
+    Attributes:
+        key: Stable identifier for the field (e.g. "rag-compat:refCount").
+        label: Human-readable column header.
+        field_type: One of "string", "number", "boolean", "date".
+        sortable: Whether the explorer should allow sorting on this field.
+        filterable: Whether the explorer should allow filtering on this field.
+        pack_name: Owning plugin pack name (auto-populated by registry).
+    """
+
+    key: str
+    label: str
+    field_type: str = "string"
+    sortable: bool = False
+    filterable: bool = False
+    pack_name: str = ""
+
+
+@dataclass(frozen=True)
+class ImportTransform:
+    """A named transformation applied to a record during import.
+
+    Attributes:
+        name: Unique identifier for this transform (e.g. "rag-compat:legacy-refs").
+        description: Human-readable explanation of what the transform does.
+        transform: Callable that receives a raw dict and returns a transformed dict.
+        pack_name: Owning plugin pack name (auto-populated by registry).
+    """
+
+    name: str
+    description: str = ""
+    transform: Callable[[dict[str, Any]], dict[str, Any]] = field(
+        default_factory=lambda: (lambda d: d)
+    )
+    pack_name: str = ""
+
+
+@dataclass(frozen=True)
+class ExportTransform:
+    """A named transformation applied to a record during export.
+
+    Attributes:
+        name: Unique identifier for this transform (e.g. "rag-compat:flatten-refs").
+        description: Human-readable explanation of what the transform does.
+        transform: Callable that receives a record dict and returns a transformed dict.
+        pack_name: Owning plugin pack name (auto-populated by registry).
+    """
+
+    name: str
+    description: str = ""
+    transform: Callable[[dict[str, Any]], dict[str, Any]] = field(
+        default_factory=lambda: (lambda d: d)
+    )
+    pack_name: str = ""
 
 
 class ComputedTagPlugin(ABC):
@@ -29,7 +100,7 @@ class LongDocPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "length:long"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 content = doc.answer or ""
                 return self.tag_key if len(content) > 10000 else None
 
@@ -39,7 +110,7 @@ class DatasetPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "dataset:_dynamic"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return f"dataset:{doc.datasetName}" if doc.datasetName else None
     """
 
@@ -54,12 +125,13 @@ def tag_key(self) -> str:
         pass
 
     @abstractmethod
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         """Compute the tag for this document.
 
         Args:
-            doc: The GroundTruthItem to evaluate.
-                 Contains fields like 'answer', 'synthQuestion', 'refs', 'history', etc.
+            doc: The AgenticGroundTruthEntry to evaluate.
+                 Contains fields like 'answer', 'history', 'refs', etc.
+                 Legacy fields like synthQuestion, editedQuestion are accessed via computed properties.
 
         Returns:
             The tag string if applicable, None otherwise.
@@ -107,14 +179,14 @@ def register(self, plugin: ComputedTagPlugin) -> None:
         self._registered_keys.add(plugin.tag_key)
         self._plugins.append(plugin)
 
-    def compute_all(self, doc: GroundTruthItem) -> list[str]:
+    def compute_all(self, doc: AgenticGroundTruthEntry) -> list[str]:
         """Compute all applicable tags for a document.
 
         Iterates through all registered plugins and collects tags
         from plugins whose compute() method returns a tag string.
 
         Args:
-            doc: The GroundTruthItem to evaluate.
+            doc: The AgenticGroundTruthEntry to evaluate.
 
         Returns:
             A list of computed tag keys that apply to this document.
@@ -215,3 +287,434 @@ def filter_manual_tags(
     def __len__(self) -> int:
         """Return the number of registered plugins."""
         return len(self._plugins)
+
+
+# ---------------------------------------------------------------------------
+# Plugin-pack contract (broader than computed tags)
+# ---------------------------------------------------------------------------
+
+
+class PluginPack(ABC):
+    """Abstract base class for plugin packs.
+
+    A plugin pack is a named unit of domain behavior that can contribute:
+    - Startup validation (via validate_registration)
+    - Approval-time error hooks (via collect_approval_errors)
+
+    The generic core hosts plugin packs without being aware of domain details.
+    Computed-tag plugins continue to work unchanged through TagPluginRegistry.
+
+    Example (minimal no-op pack)::
+
+        class MyDomainPack(PluginPack):
+            @property
+            def name(self) -> str:
+                return "my-domain"
+
+            def validate_registration(self) -> None:
+                # assert required config exists; raise ValueError on failure
+                pass
+
+    Example (approval-contributing pack)::
+
+        class StrictRefPack(PluginPack):
+            @property
+            def name(self) -> str:
+                return "strict-ref"
+
+            def collect_approval_errors(
+                self, item: AgenticGroundTruthEntry
+            ) -> list[str]:
+                errors: list[str] = []
+                if not item.refs:
+                    errors.append("strict-ref: at least one reference is required")
+                return errors
+    """
+
+    @property
+    @abstractmethod
+    def name(self) -> str:
+        """Unique identifier for this plugin pack (e.g., 'rag-compat').
+
+        Used for duplicate-registration detection and telemetry.
+        Must be a non-empty string stable across restarts.
+        """
+
+    def validate_registration(self) -> None:
+        """Validate this pack's own registration contract at startup.
+
+        Called once during application startup by PluginPackRegistry.validate_all().
+        Raise ValueError with an actionable message if the pack is misconfigured.
+        The default implementation is a no-op.
+
+        Raises:
+            ValueError: If the pack is not correctly configured.
+        """
+
+    def collect_approval_errors(self, item: AgenticGroundTruthEntry) -> list[str]:
+        """Return pack-specific approval validation errors for an item.
+
+        Called after the generic core approval checks. Return an empty list
+        when the item is acceptable from this pack's perspective.
+
+        Args:
+            item: The item being evaluated for approval.
+
+        Returns:
+            A list of human-readable error messages, or an empty list on success.
+        """
+        return []
+
+    def collect_approval_waivers(
+        self, item: AgenticGroundTruthEntry, core_errors: list[str]
+    ) -> list[str]:
+        """Return core error messages that this pack waives for the given item.
+
+        Called after generic core approval checks produce ``core_errors``.
+        Return exact error strings from ``core_errors`` that this pack's
+        domain logic determines should be suppressed.
+
+        Args:
+            item: The item being evaluated for approval.
+            core_errors: Error messages produced by the generic core checks.
+
+        Returns:
+            A list of error strings to remove from core_errors, or empty list.
+        """
+        return []
+
+    # ------------------------------------------------------------------
+    # Extension surfaces (Phase 1 contract — default no-ops)
+    # ------------------------------------------------------------------
+
+    def get_stats_contribution(self, base_stats: dict[str, Any]) -> dict[str, Any]:
+        """Return plugin-specific stats to merge into the stats response.
+
+        Called by the stats endpoint to let each pack contribute domain-
+        specific aggregations alongside the generic core counts.
+
+        Args:
+            base_stats: The core stats dict already computed by the host.
+
+        Returns:
+            A dict of additional stats entries.  Keys must be namespaced
+            with the pack name (e.g. ``"rag-compat:refCount"``).
+        """
+        return {}
+
+    def get_explorer_fields(self) -> list[ExplorerFieldDefinition]:
+        """Return field definitions for plugin-contributed explorer columns/filters.
+
+        The host merges these into its own explorer column set so that
+        plugin-specific data can be browsed without hardcoding column
+        definitions in the core explorer.
+
+        Returns:
+            A list of field definitions.
+        """
+        return []
+
+    def get_import_transforms(self) -> list[ImportTransform]:
+        """Return transforms applied during record import.
+
+        Each transform receives the raw dict read from the import source
+        and must return a (possibly mutated) dict.  Transforms execute in
+        list order.
+
+        Returns:
+            A list of import transforms contributed by this pack.
+        """
+        return []
+
+    def get_export_transforms(self) -> list[ExportTransform]:
+        """Return transforms applied during record export.
+
+        Each transform receives the normalised record dict and must return
+        a (possibly mutated) dict suitable for the target export format.
+        Transforms execute in list order.
+
+        Returns:
+            A list of export transforms contributed by this pack.
+        """
+        return []
+
+
+class PluginPackRegistry:
+    """Registry for plugin packs with startup validation.
+
+    Maintains a named collection of PluginPack instances. Startup validation
+    calls each pack's validate_registration() method so misconfigured packs
+    fail fast with actionable errors rather than silently degrading behavior.
+
+    Example::
+
+        registry = PluginPackRegistry()
+        registry.register(RagCompatPack())
+        registry.validate_all()  # called once during app startup
+    """
+
+    def __init__(self) -> None:
+        self._packs: dict[str, PluginPack] = {}
+
+    def register(self, pack: PluginPack) -> None:
+        """Register a plugin pack.
+
+        Args:
+            pack: The plugin pack instance to register.
+
+        Raises:
+            ValueError: If a pack with the same name is already registered,
+                or if pack.name is empty.
+        """
+        pack_name = pack.name
+        if not pack_name or not pack_name.strip():
+            raise ValueError("Plugin pack name must be a non-empty string")
+        if pack_name in self._packs:
+            raise ValueError(
+                f"Duplicate plugin pack name '{pack_name}': "
+                "a pack with this name is already registered"
+            )
+        self._packs[pack_name] = pack
+
+    def validate_all(self) -> None:
+        """Run startup validation for all registered packs.
+
+        Calls validate_registration() on every registered pack. If any pack
+        raises, this method re-raises a ValueError with the pack name included
+        so the startup error is actionable.
+
+        Raises:
+            ValueError: If any pack's validate_registration() fails.
+        """
+        for name, pack in self._packs.items():
+            try:
+                pack.validate_registration()
+            except ValueError as exc:
+                raise ValueError(f"Plugin pack '{name}' failed startup validation: {exc}") from exc
+
+    def collect_approval_errors(self, item: AgenticGroundTruthEntry) -> list[str]:
+        """Gather approval errors from all registered packs.
+
+        Args:
+            item: The item being evaluated for approval.
+
+        Returns:
+            Combined list of approval error messages from all packs.
+        """
+        errors: list[str] = []
+        for pack in self._packs.values():
+            errors.extend(pack.collect_approval_errors(item))
+        return errors
+
+    def collect_approval_waivers(
+        self, item: AgenticGroundTruthEntry, core_errors: list[str]
+    ) -> list[str]:
+        """Gather core-error waivers from all registered packs.
+
+        Args:
+            item: The item being evaluated for approval.
+            core_errors: Error messages produced by the generic core checks.
+
+        Returns:
+            Combined list of core error strings that packs want suppressed.
+        """
+        waivers: list[str] = []
+        for pack in self._packs.values():
+            waivers.extend(pack.collect_approval_waivers(item, core_errors))
+        return waivers
+
+    def filter_core_errors(
+        self, item: AgenticGroundTruthEntry, core_errors: list[str]
+    ) -> list[str]:
+        """Apply pack waivers to core errors and return the filtered list.
+
+        Args:
+            item: The item being evaluated for approval.
+            core_errors: Error messages produced by the generic core checks.
+
+        Returns:
+            Core errors with waived entries removed.
+        """
+        waivers = set(self.collect_approval_waivers(item, core_errors))
+        return [e for e in core_errors if e not in waivers]
+
+    def get(self, name: str) -> PluginPack | None:
+        """Return the pack with the given name, or None if not registered."""
+        return self._packs.get(name)
+
+    def names(self) -> list[str]:
+        """Return sorted list of registered pack names."""
+        return sorted(self._packs.keys())
+
+    # ------------------------------------------------------------------
+    # Aggregation helpers for extension surfaces (Phase 1)
+    # ------------------------------------------------------------------
+
+    def collect_stats(self, base_stats: dict[str, Any]) -> dict[str, Any]:
+        """Aggregate stats contributions from all registered packs.
+
+        Args:
+            base_stats: Core stats dict already computed by the host.
+
+        Returns:
+            A merged dict containing base stats plus pack contributions.
+            Pack-contributed keys overwrite base keys on collision.
+        """
+        merged: dict[str, Any] = dict(base_stats)
+        for pack in self._packs.values():
+            merged.update(pack.get_stats_contribution(base_stats))
+        return merged
+
+    def collect_explorer_fields(self) -> list[ExplorerFieldDefinition]:
+        """Collect explorer field definitions from all registered packs.
+
+        Returns:
+            Combined list of field definitions with ``pack_name`` populated.
+        """
+        fields: list[ExplorerFieldDefinition] = []
+        for pack in self._packs.values():
+            for f in pack.get_explorer_fields():
+                populated = ExplorerFieldDefinition(
+                    key=f.key,
+                    label=f.label,
+                    field_type=f.field_type,
+                    sortable=f.sortable,
+                    filterable=f.filterable,
+                    pack_name=pack.name,
+                )
+                fields.append(populated)
+        return fields
+
+    def collect_import_transforms(self) -> list[ImportTransform]:
+        """Collect import transforms from all registered packs (ordered by pack name).
+
+        Returns:
+            Combined list of import transforms with ``pack_name`` populated.
+        """
+        transforms: list[ImportTransform] = []
+        for pack in self._packs.values():
+            for t in pack.get_import_transforms():
+                populated = ImportTransform(
+                    name=t.name,
+                    description=t.description,
+                    transform=t.transform,
+                    pack_name=pack.name,
+                )
+                transforms.append(populated)
+        return transforms
+
+    def collect_export_transforms(self) -> list[ExportTransform]:
+        """Collect export transforms from all registered packs (ordered by pack name).
+
+        Returns:
+            Combined list of export transforms with ``pack_name`` populated.
+        """
+        transforms: list[ExportTransform] = []
+        for pack in self._packs.values():
+            for t in pack.get_export_transforms():
+                populated = ExportTransform(
+                    name=t.name,
+                    description=t.description,
+                    transform=t.transform,
+                    pack_name=pack.name,
+                )
+                transforms.append(populated)
+        return transforms
+
+    def __len__(self) -> int:
+        """Return the number of registered packs."""
+        return len(self._packs)
+
+
+# ---------------------------------------------------------------------------
+# Trace adapter plugin system
+# ---------------------------------------------------------------------------
+
+
+class TraceAdapterPlugin(ABC):
+    """Abstract base class for trace-format adapter plugins.
+
+    Each plugin defines how a specific trace payload format is transformed
+    into ``AgenticGroundTruthEntry`` objects.  End users drop concrete
+    implementations into ``plugins/adapters/`` and they are auto-discovered
+    at startup.
+
+    Example::
+
+        class MyCustomAdapter(TraceAdapterPlugin):
+            @property
+            def name(self) -> str:
+                return "my-custom-format"
+
+            def adapt_payload(
+                self, payload: Mapping[str, Any], **kwargs: Any,
+            ) -> list[AgenticGroundTruthEntry]:
+                ...
+    """
+
+    @property
+    @abstractmethod
+    def name(self) -> str:
+        """Unique identifier for this adapter (e.g. ``'trace-export'``)."""
+
+    @abstractmethod
+    def adapt_payload(
+        self,
+        payload: Mapping[str, Any],
+        **kwargs: Any,
+    ) -> list[AgenticGroundTruthEntry]:
+        """Transform a raw payload into ground-truth entries.
+
+        Args:
+            payload: The raw trace payload (format is adapter-specific).
+            **kwargs: Additional adapter-specific configuration
+                      (e.g. ``dataset_name``, ``bucket``).
+
+        Returns:
+            A list of adapted ground-truth entries.
+        """
+
+
+class TraceAdapterRegistry:
+    """Registry for auto-discovered trace adapter plugins.
+
+    Provides lookup-by-name and enumeration of all registered adapters.
+    """
+
+    def __init__(self) -> None:
+        self._adapters: dict[str, type[TraceAdapterPlugin]] = {}
+
+    def register(self, adapter_cls: type[TraceAdapterPlugin]) -> None:
+        """Register an adapter plugin class.
+
+        Args:
+            adapter_cls: A *class* (not instance) extending TraceAdapterPlugin.
+
+        Raises:
+            ValueError: If the adapter name is empty or already registered.
+        """
+        # Instantiate briefly to read the name property
+        instance = adapter_cls.__new__(adapter_cls)
+        name = instance.name
+        if not name:
+            raise ValueError("TraceAdapterPlugin.name must be non-empty")
+        if name in self._adapters:
+            raise ValueError(
+                f"Duplicate trace adapter name '{name}': "
+                f"{self._adapters[name].__name__} already registered"
+            )
+        self._adapters[name] = adapter_cls
+
+    def get(self, name: str) -> type[TraceAdapterPlugin] | None:
+        """Look up an adapter class by name."""
+        return self._adapters.get(name)
+
+    def names(self) -> list[str]:
+        """Return sorted list of all registered adapter names."""
+        return sorted(self._adapters)
+
+    def __len__(self) -> int:
+        return len(self._adapters)
+
+    def __contains__(self, name: str) -> bool:
+        return name in self._adapters
diff --git a/backend/app/plugins/computed_tags/__init__.py b/backend/app/plugins/computed_tags/__init__.py
index 8c6698f..7c9a519 100644
--- a/backend/app/plugins/computed_tags/__init__.py
+++ b/backend/app/plugins/computed_tags/__init__.py
@@ -11,13 +11,13 @@
 
 Example:
     from app.plugins.base import ComputedTagPlugin
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
 
     class MyPlugin(ComputedTagPlugin):
         @property
         def tag_key(self) -> str:
             return "my_group:my_value"
 
-        def compute(self, doc: GroundTruthItem) -> bool:
-            return some_condition(doc)
+        def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
+            return self.tag_key if some_condition(doc) else None
 """
diff --git a/backend/app/plugins/computed_tags/dataset.py b/backend/app/plugins/computed_tags/dataset.py
index 394097a..c2b7940 100644
--- a/backend/app/plugins/computed_tags/dataset.py
+++ b/backend/app/plugins/computed_tags/dataset.py
@@ -10,7 +10,7 @@
 from app.plugins.base import ComputedTagPlugin
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
 
 
 class DatasetPlugin(ComputedTagPlugin):
@@ -28,5 +28,5 @@ class DatasetPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "dataset:_dynamic"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return f"dataset:{doc.datasetName}" if doc.datasetName else None
diff --git a/backend/app/plugins/computed_tags/no_answer.py b/backend/app/plugins/computed_tags/no_answer.py
index 6ded997..5fdd3ea 100644
--- a/backend/app/plugins/computed_tags/no_answer.py
+++ b/backend/app/plugins/computed_tags/no_answer.py
@@ -11,7 +11,7 @@
 from app.plugins.base import ComputedTagPlugin
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
 
 
 class NoAnswerPlugin(ComputedTagPlugin):
@@ -29,7 +29,7 @@ class NoAnswerPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "answer:no_answer"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         if doc.answer and doc.answer.strip().casefold() == "no_answer":
             return self.tag_key
         return None
diff --git a/backend/app/plugins/computed_tags/question_length.py b/backend/app/plugins/computed_tags/question_length.py
index abd888a..7b09359 100644
--- a/backend/app/plugins/computed_tags/question_length.py
+++ b/backend/app/plugins/computed_tags/question_length.py
@@ -1,7 +1,7 @@
 """Computed tag plugins for question length classification.
 
 This module provides plugins that tag documents based on the word count
-of the question (synthQuestion or editedQuestion).
+of the question (computed via the question property accessor).
 
 Word count thresholds:
 - short: SHORT_MAX_WORDS words or fewer
@@ -16,7 +16,7 @@
 from app.plugins.base import ComputedTagPlugin
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
 
 # Word count thresholds for question length classification
 SHORT_MAX_WORDS = 10  # Questions with this many words or fewer are "short"
@@ -25,14 +25,14 @@
 )
 
 
-def _get_question_word_count(doc: GroundTruthItem) -> int:
+def _get_question_word_count(doc: AgenticGroundTruthEntry) -> int:
     """Get the word count for the document's question.
 
-    Uses editedQuestion if available, otherwise synthQuestion.
-    Uses .split() to count words as specified in requirements.
+    Uses the computed property accessor which returns editedQuestion if available,
+    otherwise synthQuestion. Uses .split() to count words as specified in requirements.
 
     Args:
-        doc: The GroundTruthItem to evaluate.
+        doc: The AgenticGroundTruthEntry to evaluate.
 
     Returns:
         The number of words in the question.
@@ -51,7 +51,7 @@ class QuestionLengthLongPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "question_length:long"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _get_question_word_count(doc) > MEDIUM_MAX_WORDS else None
 
 
@@ -65,7 +65,7 @@ class QuestionLengthMediumPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "question_length:medium"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         count = _get_question_word_count(doc)
         return self.tag_key if SHORT_MAX_WORDS < count <= MEDIUM_MAX_WORDS else None
 
@@ -80,5 +80,5 @@ class QuestionLengthShortPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "question_length:short"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _get_question_word_count(doc) <= SHORT_MAX_WORDS else None
diff --git a/backend/app/plugins/computed_tags/reference_type.py b/backend/app/plugins/computed_tags/reference_type.py
index 904114d..c22cdf6 100644
--- a/backend/app/plugins/computed_tags/reference_type.py
+++ b/backend/app/plugins/computed_tags/reference_type.py
@@ -18,7 +18,7 @@
 from app.plugins.base import ComputedTagPlugin
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem, Reference
+    from app.domain.models import AgenticGroundTruthEntry, Reference
 
 # Pattern for article references: CS followed by digits (e.g., CS431120)
 _ARTICLE_PATTERN = re.compile(r"CS\d+", re.IGNORECASE)
@@ -52,31 +52,34 @@ def _is_helpcenter_url(url: str) -> bool:
     return "/help" in url.lower()
 
 
-def _get_all_references(doc: GroundTruthItem) -> list[Reference]:
+def _get_all_references(doc: AgenticGroundTruthEntry) -> list[Reference]:
     """Get all references from a document, including those in history turns.
 
     Args:
-        doc: The GroundTruthItem to evaluate.
+        doc: The AgenticGroundTruthEntry to evaluate.
 
     Returns:
         A list of all Reference objects from the document.
     """
+    from app.domain.models import HistoryItem
+
     refs: list[Reference] = list(doc.refs or [])
 
     # Also gather refs from history turns
+    # HistoryItem (subclass of HistoryEntry) has refs field
     if doc.history:
         for turn in doc.history:
-            if turn.refs:
+            if isinstance(turn, HistoryItem) and turn.refs:
                 refs.extend(turn.refs)
 
     return refs
 
 
-def _has_article_reference(doc: GroundTruthItem) -> bool:
+def _has_article_reference(doc: AgenticGroundTruthEntry) -> bool:
     """Check if document has at least one article reference.
 
     Args:
-        doc: The GroundTruthItem to evaluate.
+        doc: The AgenticGroundTruthEntry to evaluate.
 
     Returns:
         True if at least one reference URL matches the article pattern.
@@ -85,11 +88,11 @@ def _has_article_reference(doc: GroundTruthItem) -> bool:
     return any(_is_article_url(ref.url) for ref in refs)
 
 
-def _has_helpcenter_reference(doc: GroundTruthItem) -> bool:
+def _has_helpcenter_reference(doc: AgenticGroundTruthEntry) -> bool:
     """Check if document has at least one helpcenter reference.
 
     Args:
-        doc: The GroundTruthItem to evaluate.
+        doc: The AgenticGroundTruthEntry to evaluate.
 
     Returns:
         True if at least one reference URL contains '/help'.
@@ -109,7 +112,7 @@ class ReferenceTypeArticlePlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "reference_type:article"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _has_article_reference(doc) else None
 
 
@@ -124,5 +127,5 @@ class ReferenceTypeHelpcenterPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "reference_type:helpcenter"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _has_helpcenter_reference(doc) else None
diff --git a/backend/app/plugins/computed_tags/retrieval_behavior.py b/backend/app/plugins/computed_tags/retrieval_behavior.py
index 5096adf..e2a62e1 100644
--- a/backend/app/plugins/computed_tags/retrieval_behavior.py
+++ b/backend/app/plugins/computed_tags/retrieval_behavior.py
@@ -15,17 +15,17 @@
 from app.plugins.base import ComputedTagPlugin
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
 
 
-def _get_total_reference_count(doc: GroundTruthItem) -> int:
+def _get_total_reference_count(doc: AgenticGroundTruthEntry) -> int:
     """Get the total count of references from a document.
 
     Uses the totalReferences computed field which counts refs at item level
     and across all history turns.
 
     Args:
-        doc: The GroundTruthItem to evaluate.
+        doc: The AgenticGroundTruthEntry to evaluate.
 
     Returns:
         The total number of references.
@@ -43,7 +43,7 @@ class RetrievalBehaviorNoRefsPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "retrieval_behavior:no_refs"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _get_total_reference_count(doc) == 0 else None
 
 
@@ -57,7 +57,7 @@ class RetrievalBehaviorSinglePlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "retrieval_behavior:single"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _get_total_reference_count(doc) == 1 else None
 
 
@@ -71,7 +71,7 @@ class RetrievalBehaviorTwoRefsPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "retrieval_behavior:two_refs"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _get_total_reference_count(doc) == 2 else None
 
 
@@ -85,5 +85,5 @@ class RetrievalBehaviorRichPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "retrieval_behavior:rich"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         return self.tag_key if _get_total_reference_count(doc) >= 3 else None
diff --git a/backend/app/plugins/computed_tags/turns.py b/backend/app/plugins/computed_tags/turns.py
index 20807f4..166bc05 100644
--- a/backend/app/plugins/computed_tags/turns.py
+++ b/backend/app/plugins/computed_tags/turns.py
@@ -11,7 +11,7 @@
 from app.plugins.base import ComputedTagPlugin
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
 
 
 class MultiTurnPlugin(ComputedTagPlugin):
@@ -25,7 +25,7 @@ class MultiTurnPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "turns:multiturn"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         history = doc.history
         if not history or not isinstance(history, list):
             return None
@@ -43,7 +43,7 @@ class SingleTurnPlugin(ComputedTagPlugin):
     def tag_key(self) -> str:
         return "turns:singleturn"
 
-    def compute(self, doc: GroundTruthItem) -> str | None:
+    def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
         history = doc.history
         if not history or not isinstance(history, list):
             return self.tag_key
diff --git a/backend/app/plugins/pack_registry.py b/backend/app/plugins/pack_registry.py
new file mode 100644
index 0000000..f95ff3f
--- /dev/null
+++ b/backend/app/plugins/pack_registry.py
@@ -0,0 +1,91 @@
+"""Global default plugin-pack registry.
+
+Provides a lazy-initialized singleton PluginPackRegistry that holds all
+built-in plugin packs.  The registry is validated at startup by the Container;
+if any pack's validate_registration() raises, the app will not start.
+
+Usage::
+
+    from app.plugins.pack_registry import get_default_pack_registry
+
+    # During startup:
+    get_default_pack_registry().validate_all()
+
+    # During approval:
+    errors = get_default_pack_registry().collect_approval_errors(item)
+"""
+
+from __future__ import annotations
+
+import logging
+import threading
+
+from app.plugins.base import PluginPackRegistry
+
+logger = logging.getLogger(__name__)
+
+# ---------------------------------------------------------------------------
+# Global singleton — thread-safe double-checked locking pattern
+# (mirrors registry.py which owns the TagPluginRegistry singleton)
+# ---------------------------------------------------------------------------
+
+_default_pack_registry: PluginPackRegistry | None = None
+_pack_registry_lock = threading.Lock()
+
+
+def create_default_pack_registry() -> PluginPackRegistry:
+    """Create a PluginPackRegistry pre-populated with all built-in packs.
+
+    Returns:
+        A registry containing the RagCompatPack.
+    """
+    from app.plugins.packs.rag_compat import RagCompatPack
+
+    registry = PluginPackRegistry()
+    registry.register(RagCompatPack())
+    logger.debug("plugin_pack_registry.created | packs=%s", registry.names())
+    return registry
+
+
+def get_default_pack_registry() -> PluginPackRegistry:
+    """Return the global default plugin-pack registry.
+
+    Creates the registry on first access (lazy initialization).
+    Thread-safe using double-checked locking.
+
+    Returns:
+        The global PluginPackRegistry instance.
+    """
+    global _default_pack_registry
+    if _default_pack_registry is None:
+        with _pack_registry_lock:
+            if _default_pack_registry is None:
+                _default_pack_registry = create_default_pack_registry()
+    assert _default_pack_registry is not None
+    return _default_pack_registry
+
+
+def reset_default_pack_registry() -> None:
+    """Reset the global default pack registry.
+
+    Primarily used in tests to ensure a clean state between test cases.
+    """
+    global _default_pack_registry
+    _default_pack_registry = None
+
+
+def get_required_pack(name: str, registry: PluginPackRegistry | None = None):
+    active_registry = registry or get_default_pack_registry()
+    pack = active_registry.get(name)
+    if pack is None:
+        raise LookupError(f"Required plugin pack '{name}' is not registered")
+    return pack
+
+
+def get_rag_compat_pack(registry: PluginPackRegistry | None = None):
+    from app.plugins.packs.rag_compat import RagCompatPack
+
+    pack = get_required_pack("rag-compat", registry)
+    if not isinstance(pack, RagCompatPack):
+        raise TypeError("Registered 'rag-compat' pack is not a RagCompatPack instance")
+    return pack
diff --git a/backend/app/plugins/packs/__init__.py b/backend/app/plugins/packs/__init__.py
new file mode 100644
index 0000000..1be114b
--- /dev/null
+++ b/backend/app/plugins/packs/__init__.py
@@ -0,0 +1,9 @@
+"""Plugin packs package.
+
+This package contains plugin-pack implementations that contribute domain-specific
+behavior (approval validation, explorer summaries, startup validation) on top of
+the generic agentic core.
+
+Built-in packs:
+    - rag_compat: RAG compatibility pack, owns retrieval-specific behavior.
+"""
diff --git a/backend/app/plugins/packs/rag_compat.py b/backend/app/plugins/packs/rag_compat.py
new file mode 100644
index 0000000..fd712bd
--- /dev/null
+++ b/backend/app/plugins/packs/rag_compat.py
@@ -0,0 +1,627 @@
+"""RAG compatibility pack.
+
+This pack owns retrieval-specific behavior on the generic agentic host:
+- Validates its own plugin-kind constant at startup so mismatches are detected
+  before any data is processed.
+- Projects per-item RAG state from ``plugins["rag-compat"].data`` via the
+  compat-accessor helpers already present on AgenticGroundTruthEntry.
+- Provides the canonical ``rag_compat_data``, ``refs_from_item``,
+  ``attach_reference``, and ``detach_reference`` helpers so reference
+  manipulation stays in one owned location rather than being inlined across
+  multiple services.
+- Contributes approval validation hooks that enforce RAG-specific invariants on
+  top of the generic core checks.
+
+Retrieval search remains available through the standard ``/v1/search`` endpoint
+(backed by SearchService), which handles the generic query path independently.
+Reference selection and attachment are owned by this pack.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import TYPE_CHECKING, Any
+
+from app.plugins.base import ExplorerFieldDefinition, ExportTransform, PluginPack
+
+if TYPE_CHECKING:
+    from app.domain.models import AgenticGroundTruthEntry, Reference
+
+logger = logging.getLogger(__name__)
+
+# The plugin-kind key stored inside AgenticGroundTruthEntry.plugins.
+# This MUST match AgenticGroundTruthEntry._RAG_COMPAT_PLUGIN.
+# validate_registration() enforces this at startup.
+_RAG_COMPAT_KIND: str = "rag-compat"
+
+_LEGACY_PLUGIN_FIELDS: tuple[str, ...] = (
+    "synthQuestion",
+    "editedQuestion",
+    "answer",
+    "refs",
+    "contextUsedForGeneration",
+    "contextSource",
+    "modelUsedForGeneration",
+    "semanticClusterNumber",
+    "weight",
+    "samplingBucket",
+    "questionLength",
+    "totalReferences",
+)
+
+_LEGACY_PLUGIN_FIELD_ALIASES: dict[str, str] = {
+    "synth_question": "synthQuestion",
+    "edited_question": "editedQuestion",
+    "context_used_for_generation": "contextUsedForGeneration",
+    "context_source": "contextSource",
+    "model_used_for_generation": "modelUsedForGeneration",
+    "semantic_cluster_number": "semanticClusterNumber",
+    "sampling_bucket": "samplingBucket",
+    "question_length": "questionLength",
+    "total_references": "totalReferences",
+}
+
+
+def _coerce_reference_list(raw_refs: Any) -> list[Any]:
+    if not isinstance(raw_refs, list):
+        return []
+
+    from app.domain.models import Reference
+
+    return [
+        ref if isinstance(ref, Reference) else Reference.model_validate(ref) for ref in raw_refs
+    ]
+
+
+def _history_message(history: Any, role: str, *, reverse: bool = False) -> str | None:
+    if not isinstance(history, list):
+        return None
+    iterator = reversed(history) if reverse else history
+    for turn in iterator:
+        if hasattr(turn, "role") and hasattr(turn, "msg"):
+            current_role = str(getattr(turn, "role", "")).strip().lower()
+            current_msg = str(getattr(turn, "msg", "")).strip()
+        elif isinstance(turn, dict):
+            current_role = str(turn.get("role", "")).strip().lower()
+            current_msg = str(turn.get("msg") or turn.get("content") or "").strip()
+        else:
+            continue
+        if current_role == role and current_msg:
+            return current_msg
+    return None
+
+
+def rag_compat_data_from_payload(
+    payload: dict[str, Any], *, plugin_name: str = _RAG_COMPAT_KIND
+) -> dict[str, Any]:
+    plugins = payload.get("plugins")
+    if not isinstance(plugins, dict):
+        return {}
+    plugin = plugins.get(plugin_name)
+    if hasattr(plugin, "data"):
+        plugin_data = getattr(plugin, "data", None)
+        return dict(plugin_data) if isinstance(plugin_data, dict) else {}
+    if isinstance(plugin, dict):
+        plugin_data = plugin.get("data")
+        return dict(plugin_data) if isinstance(plugin_data, dict) else {}
+    return {}
+
+
+def normalize_legacy_payload_for_core_model(
+    value: object, *, plugin_name: str = _RAG_COMPAT_KIND
+) -> object:
+    if not isinstance(value, dict):
+        return value
+
+    data = dict(value)
+    data.pop("tags", None)
+    legacy_payload: dict[str, Any] = {}
+
+    for alias, canonical in _LEGACY_PLUGIN_FIELD_ALIASES.items():
+        if alias not in data:
+            continue
+        alias_value = data.pop(alias)
+        if canonical not in data:
+            data[canonical] = alias_value
+
+    for field_name in _LEGACY_PLUGIN_FIELDS:
+        if field_name in data:
+            legacy_payload[field_name] = data.pop(field_name)
+
+    if "refs" in legacy_payload:
+        legacy_payload["refs"] = _coerce_reference_list(legacy_payload["refs"])
+
+    history_value = data.get("history")
+    if isinstance(history_value, list):
+        normalized_history: list[dict[str, Any]] = []
+        history_annotations: list[dict[str, Any]] = []
+        saw_history_annotations = False
+        for raw_entry in history_value:
+            if hasattr(raw_entry, "model_dump"):
+                entry_dict = raw_entry.model_dump(by_alias=True, exclude_none=True)
+            elif isinstance(raw_entry, dict):
+                entry_dict = dict(raw_entry)
+            else:
+                normalized_history.append(raw_entry)
+                history_annotations.append({})
+                continue
+
+            annotation: dict[str, Any] = {}
+            if "refs" in entry_dict:
+                annotation["refs"] = _coerce_reference_list(entry_dict.pop("refs"))
+                saw_history_annotations = True
+            expected_behavior = entry_dict.pop(
+                "expectedBehavior", entry_dict.pop("expected_behavior", None)
+            )
+            if expected_behavior is not None:
+                annotation["expectedBehavior"] = expected_behavior
+                saw_history_annotations = True
+
+            message = entry_dict.get("msg")
+            if message is None and "content" in entry_dict:
+                message = entry_dict.pop("content")
+            normalized_history.append(
+                {
+                    "role": entry_dict.get("role", ""),
+                    "msg": message or "",
+                }
+            )
+            history_annotations.append(annotation)
+
+        data["history"] = normalized_history
+        if saw_history_annotations:
+            legacy_payload["historyAnnotations"] = history_annotations
+    elif history_value is None and (
+        legacy_payload.get("editedQuestion")
+        or legacy_payload.get("synthQuestion")
+        or legacy_payload.get("answer")
+    ):
+        generated_history: list[dict[str, Any]] = []
+        question_text = legacy_payload.get("editedQuestion") or legacy_payload.get("synthQuestion")
+        if question_text:
+            generated_history.append({"role": "user", "msg": question_text})
+        if legacy_payload.get("answer"):
+            generated_history.append({"role": "assistant", "msg": legacy_payload["answer"]})
+        data["history"] = generated_history
+
+    if not legacy_payload:
+        return data
+
+    plugins_payload = dict(data.get("plugins") or {})
+    existing_plugin = plugins_payload.get(plugin_name)
+    if hasattr(existing_plugin, "model_dump"):
+        plugin_dict = existing_plugin.model_dump(by_alias=True)
+    elif isinstance(existing_plugin, dict):
+        plugin_dict = dict(existing_plugin)
+    else:
+        plugin_dict = {"kind": plugin_name, "version": "1.0", "data": {}}
+    plugin_data_raw = plugin_dict.get("data")
+    plugin_data = dict(plugin_data_raw) if isinstance(plugin_data_raw, dict) else {}
+    plugin_data.update(legacy_payload)
+    plugin_dict["kind"] = plugin_dict.get("kind") or plugin_name
+    plugin_dict["version"] = plugin_dict.get("version") or "1.0"
+    plugin_dict["data"] = plugin_data
+    plugins_payload[plugin_name] = plugin_dict
+    data["plugins"] = plugins_payload
+    return data
+
+
+def compat_refs_from_payload(
+    payload: dict[str, Any], *, plugin_name: str = _RAG_COMPAT_KIND
+) -> list[Any]:
+    compat = rag_compat_data_from_payload(payload, plugin_name=plugin_name)
+    refs = _coerce_reference_list(compat.get("refs"))
+    if refs:
+        return refs
+
+    retrievals = compat.get("retrievals")
+    if not isinstance(retrievals, dict):
+        return []
+
+    from app.domain.models import Reference
+
+    tool_calls = payload.get("toolCalls") or payload.get("tool_calls") or []
+    step_by_tool_call_id: dict[str, int | None] = {}
+    if isinstance(tool_calls, list):
+        for tool_call in tool_calls:
+            if hasattr(tool_call, "id"):
+                tool_call_id = getattr(tool_call, "id", "")
+                step_number = getattr(tool_call, "step_number", None)
+            elif isinstance(tool_call, dict):
+                tool_call_id = str(tool_call.get("id") or "")
+                step_number = tool_call.get("stepNumber", tool_call.get("step_number"))
+            else:
+                continue
+            if tool_call_id:
+                step_by_tool_call_id[tool_call_id] = (
+                    step_number if isinstance(step_number, int) else None
+                )
+
+    flattened: list[Reference] = []
+    for tool_call_id, bucket in retrievals.items():
+        if not isinstance(bucket, dict):
+            continue
+        candidates = bucket.get("candidates")
+        if not isinstance(candidates, list):
+            continue
+        for candidate in candidates:
+            if not isinstance(candidate, dict):
+                continue
+            candidate_tool_call_id = candidate.get("toolCallId") or (
+                tool_call_id if tool_call_id != RagCompatPack._UNASSOCIATED_KEY else None
+            )
+            flattened.append(
+                Reference(
+                    url=str(candidate.get("url") or ""),
+                    title=candidate.get("title"),
+                    content=candidate.get("chunk"),
+                    messageIndex=step_by_tool_call_id.get(str(candidate_tool_call_id))
+                    if candidate_tool_call_id
+                    else None,
+                )
+            )
+    return flattened
+
+
+def compat_total_references_from_payload(
+    payload: dict[str, Any], *, plugin_name: str = _RAG_COMPAT_KIND
+) -> int:
+    compat = rag_compat_data_from_payload(payload, plugin_name=plugin_name)
+    explicit_total = compat.get("totalReferences")
+    if isinstance(explicit_total, int):
+        return explicit_total
+
+    history_count = 0
+    history_annotations = compat.get("historyAnnotations")
+    if isinstance(history_annotations, list):
+        for annotation in history_annotations:
+            if isinstance(annotation, dict) and isinstance(annotation.get("refs"), list):
+                history_count += len(annotation["refs"])
+    if history_count:
+        return history_count
+    return len(compat_refs_from_payload(payload, plugin_name=plugin_name))
+
+
+def apply_export_projection(
+    doc: dict[str, Any], *, plugin_name: str = _RAG_COMPAT_KIND
+) -> dict[str, Any]:
+    projected = dict(doc)
+    compat = rag_compat_data_from_payload(projected, plugin_name=plugin_name)
+    if not compat:
+        return projected
+
+    refs = compat_refs_from_payload(projected, plugin_name=plugin_name)
+    projected["refs"] = [ref.model_dump(by_alias=True, exclude_none=True) for ref in refs]
+    projected["totalReferences"] = len(refs)
+
+    if projected.get("synthQuestion") is None:
+        projected["synthQuestion"] = compat.get("synthQuestion") or _history_message(
+            projected.get("history"), "user"
+        )
+    if projected.get("editedQuestion") is None:
+        projected["editedQuestion"] = compat.get("editedQuestion") or projected.get("synthQuestion")
+    if projected.get("answer") is None:
+        projected["answer"] = compat.get("answer") or _history_message(
+            projected.get("history"), "assistant", reverse=True
+        )
+
+    return projected
+
+
+class RagCompatPack(PluginPack):
+    """RAG compatibility pack.
+
+    Owns retrieval-specific behavior behind the generic plugin-pack contract.
+    Registered at startup via PluginPackRegistry so misconfiguration raises
+    a clear startup error instead of silently producing wrong data.
+
+    Design notes:
+    - The ``rag-compat`` plugin payload is written by
+      AgenticGroundTruthEntry.translate_legacy_payload_for_core_model during
+      ingest of legacy RAG-shaped documents.
+    - Core approval checks (history, tool-call consistency) run before pack
+      hooks. The pack adds RAG-specific approval gates that cannot be expressed
+      generically.
+    - The pack does NOT add new top-level fields to the host model; all RAG
+      state is accessed via plugins["rag-compat"].data.
+    - Reference attachment and detachment are owned by this pack; the generic
+      SearchService only owns the query path.
+    """
+
+    @property
+    def name(self) -> str:
+        return _RAG_COMPAT_KIND
+
+    def validate_registration(self) -> None:
+        """Validate that the rag-compat kind constant matches the host model.
+
+        Fails startup if someone renames the plugin key in
+        AgenticGroundTruthEntry without updating this pack (or vice-versa).
+        """
+        from app.domain.models import AgenticGroundTruthEntry
+
+        expected = AgenticGroundTruthEntry._RAG_COMPAT_PLUGIN
+        if expected != _RAG_COMPAT_KIND:
+            raise ValueError(
+                f"RagCompatPack kind '{_RAG_COMPAT_KIND}' does not match "
+                f"AgenticGroundTruthEntry._RAG_COMPAT_PLUGIN '{expected}'. "
+                "Update _RAG_COMPAT_KIND in rag_compat.py to keep them in sync."
+            )
+        logger.debug("rag_compat_pack.validate_registration.ok | kind=%s", _RAG_COMPAT_KIND)
+
+    def collect_approval_errors(self, item: AgenticGroundTruthEntry) -> list[str]:
+        """Return RAG-specific approval errors for an item.
+
+        Items that have no RAG compat data receive no additional errors.
+        """
+        compat = self.rag_compat_data(item)
+        if not compat:
+            return []
+        # RAG items: future validation hooks go here.
+        # e.g. per-retrieval-call selection completeness could be enforced once
+        # FR-029/FR-030 retrieval tool-call per-call state is implemented.
+        return []
+
+    def collect_approval_waivers(
+        self, item: AgenticGroundTruthEntry, core_errors: list[str]
+    ) -> list[str]:
+        """Waive core errors that do not apply to RAG retrieval-only items.
+
+        When an item has ``totalReferences > 0`` (indicating it is a
+        retrieval-based item), the following core checks are waived:
+        - "history must include at least one assistant message" — retrieval-only
+          items may not produce an assistant reply.
+        - "expectedTools.required must include at least one tool…" — retrieval
+          items may use reference attachment instead of classified tool calls.
+        """
+        if self.reference_count(item) == 0:
+            return []
+
+        waivers: list[str] = []
+        assistant_error = "history must include at least one assistant message"
+        if assistant_error in core_errors:
+            waivers.append(assistant_error)
+
+        required_tools_error = (
+            "expectedTools.required must include at least one tool "
+            "before approval when toolCalls are present"
+        )
+        if required_tools_error in core_errors:
+            waivers.append(required_tools_error)
+
+        return waivers
+
+    # ------------------------------------------------------------------
+    # Accessor helpers — owned by this pack so callers don't embed the
+    # plugin-kind string literal elsewhere.
+    # ------------------------------------------------------------------
+
+    def rag_compat_data(self, item: AgenticGroundTruthEntry) -> dict[str, Any]:
+        """Return the raw rag-compat plugin data dict for an item, or {}."""
+        return item.get_plugin_data(_RAG_COMPAT_KIND) or {}
+
+    def refs_from_item(self, item: AgenticGroundTruthEntry) -> list[Any]:
+        """Return the references list projected from the rag-compat payload."""
+        return compat_refs_from_payload(
+            {
+                "plugins": item.plugins,
+                "toolCalls": item.tool_calls,
+                "history": item.history,
+            }
+        )
+
+    def reference_count(self, item: AgenticGroundTruthEntry) -> int:
+        refs = self.refs_from_item(item)
+        if refs:
+            return len(refs)
+
+        compat = self.rag_compat_data(item)
+        explicit_total = compat.get("totalReferences")
+        return explicit_total if isinstance(explicit_total, int) and explicit_total > 0 else 0
+
+    def replace_references(
+        self, item: AgenticGroundTruthEntry, refs: list[Reference]
+    ) -> AgenticGroundTruthEntry:
+        serialized = [ref.model_dump(by_alias=True, exclude_none=True) for ref in refs]
+        item._set_rag_compat_value("refs", serialized)
+        item._set_rag_compat_value("retrievals", None)
+        # Clear cached totalReferences so it will be recomputed from refs/historyAnnotations
+        if "totalReferences" in item.__dict__:
+            del item.__dict__["totalReferences"]
+        item._set_rag_compat_value("totalReferences", None)  # Remove from plugin storage too
+        return item
+
+    def attach_reference(
+        self, item: AgenticGroundTruthEntry, ref: Reference
+    ) -> AgenticGroundTruthEntry:
+        """Attach a reference to an item via the rag-compat plugin payload.
+
+        This is a RAG-compat concern; the generic core does not manage refs.
+        The ``refs`` setter on AgenticGroundTruthEntry writes to
+        ``plugins["rag-compat"].data`` automatically.
+
+        Args:
+            item: The ground-truth item to modify in-place.
+            ref: The reference to attach.
+
+        Returns:
+            The same item (mutated in-place) for convenience.
+        """
+        current = list(self.refs_from_item(item))
+        current.append(ref)
+        return self.replace_references(item, current)
+
+    def detach_reference(
+        self, item: AgenticGroundTruthEntry, ref_url: str
+    ) -> AgenticGroundTruthEntry:
+        """Detach a reference from an item by URL, using the rag-compat payload.
+
+        This is a RAG-compat concern; the generic core does not manage refs.
+
+        Args:
+            item: The ground-truth item to modify in-place.
+            ref_url: The URL of the reference to remove.
+
+        Returns:
+            The same item (mutated in-place) for convenience.
+        """
+        remaining = [r for r in self.refs_from_item(item) if getattr(r, "url", None) != ref_url]
+        return self.replace_references(item, remaining)
+
+    # ------------------------------------------------------------------
+    # Per-tool-call retrieval state (Phase 6 — retrieval normalization)
+    #
+    # New items store references per retrieval tool call inside
+    # ``plugins["rag-compat"].data.retrievals``.
+    # Read path: per-call state first, then fall back to top-level refs.
+    # Write path: always to per-call state.
+    # ------------------------------------------------------------------
+
+    _UNASSOCIATED_KEY: str = "_unassociated"
+
+    def get_retrievals(self, item: AgenticGroundTruthEntry) -> dict[str, Any]:
+        """Return the full retrievals dict or {} when absent."""
+        compat = self.rag_compat_data(item)
+        retrievals = compat.get("retrievals")
+        return dict(retrievals) if isinstance(retrievals, dict) else {}
+
+    def get_retrieval_candidates(
+        self, item: AgenticGroundTruthEntry, tool_call_id: str
+    ) -> list[dict[str, Any]]:
+        """Return candidate list for one tool call, or []."""
+        retrievals = self.get_retrievals(item)
+        bucket = retrievals.get(tool_call_id)
+        if isinstance(bucket, dict):
+            cands = bucket.get("candidates")
+            return list(cands) if isinstance(cands, list) else []
+        return []
+
+    def set_retrieval_candidates(
+        self,
+        item: AgenticGroundTruthEntry,
+        tool_call_id: str,
+        candidates: list[dict[str, Any]],
+    ) -> None:
+        """Set candidates for a single tool call (write-through to plugin data)."""
+        compat = self.rag_compat_data(item)
+        retrievals = dict(compat.get("retrievals") or {})
+        retrievals[tool_call_id] = {"candidates": candidates}
+        item._set_rag_compat_value("retrievals", retrievals)
+
+    def set_retrievals(
+        self,
+        item: AgenticGroundTruthEntry,
+        retrievals: dict[str, Any],
+    ) -> None:
+        """Replace the entire retrievals dict."""
+        item._set_rag_compat_value("retrievals", retrievals)
+
+    def has_per_call_state(self, item: AgenticGroundTruthEntry) -> bool:
+        """Return True when per-call retrieval state exists."""
+        compat = self.rag_compat_data(item)
+        retrievals = compat.get("retrievals")
+        return isinstance(retrievals, dict) and len(retrievals) > 0
+
+    def get_all_candidates_flat(self, item: AgenticGroundTruthEntry) -> list[dict[str, Any]]:
+        """Flatten all per-call candidates into a single list.
+
+        Read path: returns per-call candidates when present.  Falls back
+        to converting top-level refs into candidate dicts for backward compat.
+        """
+        if self.has_per_call_state(item):
+            result: list[dict[str, Any]] = []
+            for tool_call_id, bucket in self.get_retrievals(item).items():
+                if not isinstance(bucket, dict):
+                    continue
+                cands = bucket.get("candidates")
+                if isinstance(cands, list):
+                    for c in cands:
+                        entry = dict(c) if isinstance(c, dict) else {}
+                        if "toolCallId" not in entry:
+                            entry["toolCallId"] = tool_call_id
+                        result.append(entry)
+            return result
+
+        # Backward compat: convert top-level refs to candidate shape
+        refs = item.refs
+        return [
+            {
+                "url": getattr(r, "url", ""),
+                "title": getattr(r, "title", None),
+                "chunk": getattr(r, "content", None),
+                "relevance": None,
+                "toolCallId": None,
+            }
+            for r in refs
+        ]
+
+    def get_explorer_fields(self) -> list[ExplorerFieldDefinition]:
+        return [
+            ExplorerFieldDefinition(
+                key="rag-compat:totalReferences",
+                label="References",
+                field_type="number",
+                sortable=True,
+                filterable=True,
+            ),
+            ExplorerFieldDefinition(
+                key="rag-compat:perCallRetrievals",
+                label="Per-Call Retrievals",
+                field_type="boolean",
+                filterable=True,
+            ),
+        ]
+
+    def get_export_transforms(self) -> list[ExportTransform]:
+        return [
+            ExportTransform(
+                name="rag-compat:project-legacy-export-fields",
+                description="Project rag-compat retrieval/reference fields into export payloads",
+                transform=apply_export_projection,
+            )
+        ]
+
+    def migrate_refs_to_per_call(self, item: AgenticGroundTruthEntry) -> bool:
+        """Migrate top-level refs into per-call state (idempotent).
+
+        Associates refs with retrieval tool calls by matching
+        ``messageIndex`` to tool-call step ordering when possible.
+        Refs that cannot be matched go into the ``_unassociated`` bucket.
+
+        Returns True if migration produced changes.
+        """
+        if self.has_per_call_state(item):
+            return False
+
+        refs = item.refs
+        if not refs:
+            return False
+
+        # Build a map from step/messageIndex to tool call id
+        tool_calls = item.tool_calls or []
+        step_to_tc: dict[int | None, str] = {}
+        for tc in tool_calls:
+            if tc.step_number is not None:
+                step_to_tc[tc.step_number] = tc.id
+
+        retrievals: dict[str, dict[str, list[dict[str, Any]]]] = {}
+        for ref in refs:
+            mi = getattr(ref, "messageIndex", None)
+            tc_id = step_to_tc.get(mi) if mi is not None else None
+            key = tc_id or self._UNASSOCIATED_KEY
+
+            if key not in retrievals:
+                retrievals[key] = {"candidates": []}
+            retrievals[key]["candidates"].append(
+                {
+                    "url": getattr(ref, "url", ""),
+                    "title": getattr(ref, "title", None),
+                    "chunk": getattr(ref, "content", None),
+                    "relevance": None,
+                    "rawPayload": None,
+                    "toolCallId": key if key != self._UNASSOCIATED_KEY else None,
+                }
+            )
+
+        self.set_retrievals(item, retrievals)
+        return True
diff --git a/backend/app/services/assignment_service.py b/backend/app/services/assignment_service.py
index 400598b..28d2022 100644
--- a/backend/app/services/assignment_service.py
+++ b/backend/app/services/assignment_service.py
@@ -2,7 +2,7 @@
 
 import re
 from app.adapters.repos.base import GroundTruthRepo
-from app.domain.models import GroundTruthItem, AssignmentDocument, Reference
+from app.domain.models import AgenticGroundTruthEntry, AssignmentDocument, HistoryItem
 from app.plugins import get_default_registry
 from app.core.errors import AssignmentConflictError
 from app.core.config import get_sampling_allocation
@@ -98,7 +98,7 @@ def compute_quotas(weights: dict[str, float], k: int) -> dict[str, int]:
 
     def can_assign_item(
         self,
-        item: GroundTruthItem,
+        item: AgenticGroundTruthEntry,
         user_id: str,
         force: bool = False,
         user_roles: list[str] | None = None,
@@ -134,12 +134,12 @@ def can_assign_item(
 
         return True, None
 
-    async def get_assigned(self, user_id: str) -> list[GroundTruthItem]:
+    async def get_assigned(self, user_id: str) -> list[AgenticGroundTruthEntry]:
         return await self.repo.list_assigned(user_id)
 
     async def sample_candidates(
         self, user_id: str, limit: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]:
+    ) -> list[AgenticGroundTruthEntry]:
         """Sample unassigned items with weighted allocation across datasets.
 
         This method implements the business logic for sampling candidates:
@@ -171,7 +171,7 @@ async def sample_candidates(
                 "exclude_count": len(exclude_ids) if exclude_ids else 0,
             },
         )
-        results: list[GroundTruthItem] = await self.repo.list_assigned(user_id)
+        results: list[AgenticGroundTruthEntry] = await self.repo.list_assigned(user_id)
         seen_ids: set[str] = {it.id for it in results}
         # Add caller-provided excludes
         if exclude_ids:
@@ -230,7 +230,7 @@ async def sample_candidates(
         )
 
         # 4) Query each dataset up to its quota (single pass)
-        per_dataset_results: dict[str, list[GroundTruthItem]] = {}
+        per_dataset_results: dict[str, list[AgenticGroundTruthEntry]] = {}
         for ds, q in quotas.items():
             if q <= 0:
                 logger.debug(
@@ -313,7 +313,7 @@ async def sample_candidates(
         )
         return final
 
-    async def self_assign(self, user_id: str, limit: int) -> list[GroundTruthItem]:
+    async def self_assign(self, user_id: str, limit: int) -> list[AgenticGroundTruthEntry]:
         if limit <= 0:
             logger.debug(
                 "self_assign.skip_non_positive_limit",
@@ -324,7 +324,7 @@ async def self_assign(self, user_id: str, limit: int) -> list[GroundTruthItem]:
         assigned_docs: list[AssignmentDocument] = []
         seen_ids: set[str] = set()
 
-        async def _try_assign(candidates: list[GroundTruthItem], remaining: int) -> None:
+        async def _try_assign(candidates: list[AgenticGroundTruthEntry], remaining: int) -> None:
             nonlocal assigned_docs, seen_ids
             if remaining <= 0:
                 return
@@ -416,7 +416,7 @@ async def _try_assign(candidates: list[GroundTruthItem], remaining: int) -> None
             await _try_assign(retry, remaining)
 
         # Return the underlying ground truth items in the order they were added
-        ground_truth_items: list[GroundTruthItem] = []
+        ground_truth_items: list[AgenticGroundTruthEntry] = []
         for ad in assigned_docs[:limit]:
             gt = await self.repo.get_gt(ad.datasetName, ad.bucket, ad.ground_truth_id)
             if gt:
@@ -460,7 +460,7 @@ async def assign_single_item(
         user_id: str,
         force: bool = False,
         user_roles: list[str] | None = None,
-    ) -> GroundTruthItem:
+    ) -> AgenticGroundTruthEntry:
         """Assign a single ground truth item to a user.
 
         This method:
@@ -598,7 +598,9 @@ async def assign_single_item(
 
         return updated
 
-    async def duplicate_item(self, original: GroundTruthItem, user_id: str) -> GroundTruthItem:
+    async def duplicate_item(
+        self, original: AgenticGroundTruthEntry, user_id: str
+    ) -> AgenticGroundTruthEntry:
         """Create a copy of a GroundTruth item for rephrasing.
 
         Rules:
@@ -617,30 +619,21 @@ async def duplicate_item(self, original: GroundTruthItem, user_id: str) -> Groun
             new_tags.append(rephrase_tag)
 
         now = datetime.now(timezone.utc)
-        new_item = GroundTruthItem(
-            id=randomname.get_name(),
-            datasetName=original.datasetName,
-            bucket=original.bucket,
-            status=GroundTruthStatus.draft,
-            synthQuestion=original.synth_question,
-            edited_question=original.edited_question,
-            answer=original.answer,
-            refs=[Reference.model_validate(r) for r in (original.refs or [])],
-            manualTags=new_tags,
-            comment=original.comment,
-            history=original.history,
-            contextUsedForGeneration=original.contextUsedForGeneration,
-            contextSource=original.contextSource,
-            modelUsedForGeneration=original.modelUsedForGeneration,
-            semanticClusterNumber=original.semanticClusterNumber,
-            weight=original.weight,
-            samplingBucket=original.samplingBucket,
-            questionLength=original.questionLength,
-            assignedTo=user_id,
-            assigned_at=now,
-            updatedBy=None,
-            reviewed_at=None,
-        )
+        new_item = AgenticGroundTruthEntry.model_validate(original.model_dump(by_alias=True))
+        new_item.history = [
+            entry
+            if isinstance(entry, HistoryItem)
+            else HistoryItem.model_validate(entry.model_dump(by_alias=True))
+            for entry in (new_item.history or [])
+        ]
+        new_item.id = randomname.get_name()
+        new_item.status = GroundTruthStatus.draft
+        new_item.manual_tags = new_tags
+        new_item.assignedTo = user_id
+        new_item.assigned_at = now
+        new_item.updatedBy = None
+        new_item.reviewed_at = None
+        new_item.etag = None
 
         # Apply computed tags based on the new item's properties
         registry = get_default_registry()
diff --git a/backend/app/services/chat_service.py b/backend/app/services/chat_service.py
deleted file mode 100644
index f0e4217..0000000
--- a/backend/app/services/chat_service.py
+++ /dev/null
@@ -1,153 +0,0 @@
-from __future__ import annotations
-
-import asyncio
-import logging
-from concurrent.futures import ThreadPoolExecutor
-from functools import partial
-from typing import Any
-
-from app.adapters.gtc_inference_adapter import GTCInferenceAdapter
-from app.adapters.agent_steps_store import AgentStepsStore
-
-logger = logging.getLogger(__name__)
-
-# Dedicated thread pool executor for blocking AI operations
-# Sized to handle concurrent requests without exhausting resources
-_executor: ThreadPoolExecutor | None = None
-_EXECUTOR_MAX_WORKERS = 10  # Configurable based on expected concurrency
-
-
-def _get_executor() -> ThreadPoolExecutor:
-    """Get or create dedicated ThreadPoolExecutor for blocking operations."""
-    global _executor
-    if _executor is None:
-        _executor = ThreadPoolExecutor(
-            max_workers=_EXECUTOR_MAX_WORKERS,
-            thread_name_prefix="gtc_inference",
-        )
-        logger.info("Created ThreadPoolExecutor with %d workers", _EXECUTOR_MAX_WORKERS)
-    return _executor
-
-
-class ChatService:
-    """Facade for generating chat responses via the inference service.
-
-    Uses threadpool to run blocking sync inference calls without blocking FastAPI.
-    """
-
-    def __init__(
-        self,
-        inference_service: GTCInferenceAdapter | None,
-        *,
-        steps_store: AgentStepsStore | None = None,
-        store_steps: bool = False,
-    ) -> None:
-        self._inference_service = inference_service
-        self._steps_store = steps_store
-        self._store_steps = store_steps and steps_store is not None
-
-    @property
-    def is_configured(self) -> bool:
-        """Return True if inference service is configured and ready."""
-        return self._inference_service is not None
-
-    def set_steps_store(self, store: AgentStepsStore | None) -> None:
-        self._steps_store = store
-        if store is None:
-            self._store_steps = False
-
-    def set_store_steps(self, enabled: bool) -> None:
-        self._store_steps = enabled and self._steps_store is not None
-
-    async def generate_response(
-        self,
-        *,
-        user_id: str,
-        message: str,
-        context: str | None,
-    ) -> dict[str, Any]:
-        """Generate a chat response using the inference service.
-
-        Runs the blocking inference call in a threadpool to keep FastAPI responsive.
-
-        Args:
-            user_id: User identifier for logging
-            message: User's question/message
-            context: Optional context (currently unused)
-
-        Returns:
-            Dict with 'content' (assistant reply) and 'references' (list of citations)
-
-        Raises:
-            ValueError: If message is empty
-            RuntimeError: If inference service not configured or request fails
-        """
-        if not message.strip():
-            raise ValueError("message cannot be empty")
-
-        logger.info(
-            "Generating response for user=%s, using_agent=%s",
-            user_id,
-            self.is_configured,
-        )
-
-        if self._inference_service is not None:
-            # Run blocking inference in threadpool
-            loop = asyncio.get_event_loop()
-            executor = _get_executor()
-
-            try:
-                response = await loop.run_in_executor(
-                    executor,
-                    partial(
-                        self._inference_service.generate,
-                        user_id=user_id,
-                        message=message,
-                    ),
-                )
-            except RuntimeError:
-                # Re-raise RuntimeError (includes "retrieval not configured" errors)
-                raise
-
-            logger.info(
-                "Agent returned response with %d references",
-                len(response.get("references", [])),
-            )
-        else:
-            logger.warning("Agent not configured, returning mock response")
-            response = self._mock_response(message)
-
-        await self._store_interaction(user_id, message, context, response)
-        return response
-
-    async def _store_interaction(
-        self,
-        user_id: str,
-        message: str,
-        context: str | None,
-        response: dict[str, Any],
-    ) -> None:
-        if not self._store_steps or self._steps_store is None:
-            return
-        payload = {
-            "message": message,
-            "context": context,
-        }
-        try:
-            await self._steps_store.save(
-                user_id=user_id,
-                request=payload,
-                response=response,
-            )
-        except Exception:
-            # Persisting steps is best-effort; never block the main response
-            pass
-
-    def _mock_response(self, message: str) -> dict[str, Any]:
-        return {
-            "content": (
-                "No agent configuration detected. Returning mock response for testing purposes. "
-                f"User message: {message}"
-            ),
-            "references": [],
-        }
diff --git a/backend/app/services/duplicate_detection_service.py b/backend/app/services/duplicate_detection_service.py
index d1660bb..439533a 100644
--- a/backend/app/services/duplicate_detection_service.py
+++ b/backend/app/services/duplicate_detection_service.py
@@ -12,12 +12,13 @@
 
 from __future__ import annotations
 
+import json
 import re
 from typing import Sequence
 
 from pydantic import BaseModel, Field, ConfigDict
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.domain.enums import GroundTruthStatus
 
 
@@ -51,12 +52,75 @@ def _normalize_text(text: str | None) -> str:
     return normalized
 
 
-def _get_question_text(item: GroundTruthItem) -> str:
+def _get_question_text(item: AgenticGroundTruthEntry) -> str:
     """Get the effective question text (edited or synth)."""
     return item.edited_question or item.synth_question or ""
 
 
-def _items_are_duplicates(draft: GroundTruthItem, approved: GroundTruthItem) -> tuple[bool, str]:
+def _serialize_generic_value(value: object) -> str:
+    if value is None:
+        return ""
+    if isinstance(value, BaseModel):
+        value = value.model_dump(by_alias=True, exclude_none=True)
+    if isinstance(value, str):
+        return value
+    try:
+        return json.dumps(value, sort_keys=True, ensure_ascii=False, default=str)
+    except TypeError:
+        return str(value)
+
+
+def _prune_empty(value: object) -> object | None:
+    if value is None:
+        return None
+    if isinstance(value, BaseModel):
+        value = value.model_dump(by_alias=True, exclude_none=True)
+    if isinstance(value, str):
+        return value or None
+    if isinstance(value, dict):
+        pruned = {
+            key: pruned_value
+            for key, nested in value.items()
+            if (pruned_value := _prune_empty(nested)) is not None
+        }
+        return pruned or None
+    if isinstance(value, list):
+        pruned = [
+            pruned_value for nested in value if (pruned_value := _prune_empty(nested)) is not None
+        ]
+        return pruned or None
+    return value
+
+
+def _history_signature(item: AgenticGroundTruthEntry) -> str:
+    history = item.history or []
+    return _normalize_text(
+        "\n".join(f"{entry.role}:{entry.msg}" for entry in history if entry.role and entry.msg)
+    )
+
+
+def _generic_signature(item: AgenticGroundTruthEntry) -> str:
+    structured_payload = _prune_empty(
+        {
+            "scenarioId": item.scenario_id,
+            "contextEntries": item.context_entries,
+            "toolCalls": item.tool_calls,
+            "expectedTools": item.expected_tools,
+            "feedback": item.feedback,
+            "metadata": item.metadata,
+            "plugins": item.plugins,
+            "traceIds": item.trace_ids,
+            "tracePayload": item.trace_payload,
+        }
+    )
+    if structured_payload is None:
+        return ""
+    return _normalize_text(_serialize_generic_value(structured_payload))
+
+
+def _items_are_duplicates(
+    draft: AgenticGroundTruthEntry, approved: AgenticGroundTruthEntry
+) -> tuple[bool, str]:
     """Check if two items are likely duplicates.
 
     Returns:
@@ -65,12 +129,8 @@ def _items_are_duplicates(draft: GroundTruthItem, approved: GroundTruthItem) ->
     draft_question = _normalize_text(_get_question_text(draft))
     approved_question = _normalize_text(_get_question_text(approved))
 
-    # Must have a question to compare
-    if not draft_question or not approved_question:
-        return (False, "")
-
-    # Check for exact question match
-    if draft_question == approved_question:
+    # Check for exact question match when both items expose question text
+    if draft_question and approved_question and draft_question == approved_question:
         # Also check answer for stronger signal
         draft_answer = _normalize_text(draft.answer)
         approved_answer = _normalize_text(approved.answer)
@@ -79,11 +139,27 @@ def _items_are_duplicates(draft: GroundTruthItem, approved: GroundTruthItem) ->
             return (True, "exact question and answer match")
         return (True, "exact question match")
 
+    draft_history = _history_signature(draft)
+    approved_history = _history_signature(approved)
+    if draft_history and draft_history == approved_history:
+        draft_generic = _generic_signature(draft)
+        approved_generic = _generic_signature(approved)
+        if draft_generic and draft_generic == approved_generic:
+            return (True, "exact history and generic fields match")
+        return (True, "exact history match")
+
+    draft_generic = _generic_signature(draft)
+    approved_generic = _generic_signature(approved)
+    if draft_generic and draft_generic == approved_generic:
+        return (True, "exact generic fields match")
+
     return (False, "")
 
 
 def detect_duplicates_for_item(
-    draft_item: GroundTruthItem, approved_items: Sequence[GroundTruthItem], max_results: int = 3
+    draft_item: AgenticGroundTruthEntry,
+    approved_items: Sequence[AgenticGroundTruthEntry],
+    max_results: int = 3,
 ) -> list[DuplicateWarning]:
     """Detect duplicate approved items for a single draft item.
 
@@ -126,8 +202,8 @@ def detect_duplicates_for_item(
 
 
 def detect_duplicates_for_bulk_items(
-    draft_items: Sequence[GroundTruthItem],
-    approved_items: Sequence[GroundTruthItem],
+    draft_items: Sequence[AgenticGroundTruthEntry],
+    approved_items: Sequence[AgenticGroundTruthEntry],
     max_results_per_item: int = 3,
 ) -> list[DuplicateWarning]:
     """Detect duplicates for multiple draft items against approved items.
diff --git a/backend/app/services/ground_truth_update_service.py b/backend/app/services/ground_truth_update_service.py
new file mode 100644
index 0000000..8a2ca69
--- /dev/null
+++ b/backend/app/services/ground_truth_update_service.py
@@ -0,0 +1,250 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from datetime import datetime, timezone
+from typing import Any, cast, Sequence
+
+from app.domain.enums import GroundTruthStatus
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+    ContextEntry,
+    ExpectedTools,
+    FeedbackEntry,
+    HistoryEntry,
+    HistoryItem,
+    PluginPayload,
+    Reference,
+    ToolCallRecord,
+)
+from app.plugins.pack_registry import get_rag_compat_pack
+from app.services.tagging_service import apply_computed_tags
+from app.services.validation_service import ValidationError, validate_item_for_approval
+
+
+MISSING = object()
+
+
+class ETagRequiredError(Exception):
+    """Raised when an update request omits optimistic-concurrency state."""
+
+
+class ETagMismatchError(Exception):
+    """Raised when the provided ETag no longer matches persisted state."""
+
+
+@dataclass(slots=True)
+class LegacyCompatUpdate:
+    edited_question: str | None | object = MISSING
+    answer: str | None | object = MISSING
+    refs: list[Reference] | object = MISSING
+
+
+@dataclass(slots=True)
+class UpdateMutationResult:
+    should_delete_assignment: bool = False
+
+
+def read_legacy_compat_update(extras: dict[str, Any]) -> LegacyCompatUpdate:
+    update = LegacyCompatUpdate()
+
+    if "editedQuestion" in extras or "edited_question" in extras:
+        update.edited_question = cast(
+            str | None, extras.get("editedQuestion", extras.get("edited_question"))
+        )
+
+    if "answer" in extras:
+        answer_value = extras["answer"]
+        if answer_value is not None and not isinstance(answer_value, str):
+            raise ValidationError("", "answer", "answer must be a string or null")
+        update.answer = cast(str | None, answer_value)
+
+    if "refs" in extras:
+        refs_payload = extras["refs"]
+        if refs_payload is None:
+            update.refs = []
+        elif isinstance(refs_payload, list):
+            update.refs = [
+                ref if isinstance(ref, Reference) else Reference.model_validate(ref)
+                for ref in refs_payload
+            ]
+        else:
+            raise ValidationError("", "refs", "refs must be a list or null")
+
+    return update
+
+
+def _parse_status(value: GroundTruthStatus | str | None) -> GroundTruthStatus:
+    if value is None:
+        raise ValidationError(
+            "", "status", "status cannot be null; omit the field to leave it unchanged"
+        )
+    try:
+        if isinstance(value, GroundTruthStatus):
+            return value
+        return GroundTruthStatus(str(value))
+    except (ValueError, KeyError) as exc:
+        raise ValidationError(
+            "",
+            "status",
+            f"Invalid status value: {value}. Must be one of: draft, approved, skipped, deleted",
+        ) from exc
+
+
+def parse_history_entries(entries: Sequence[Any]) -> list[HistoryItem]:
+    history: list[HistoryItem] = []
+    for entry in entries:
+        message = getattr(entry, "msg", None)
+        extras = getattr(entry, "model_extra", None) or {}
+        if message is None and isinstance(extras.get("content"), str):
+            message = extras["content"]
+        if not message:
+            raise ValidationError("", "history", "history entries must include a non-empty msg")
+
+        refs_data = extras.get("refs")
+        refs_list = None
+        if refs_data is not None:
+            if not isinstance(refs_data, list):
+                raise ValidationError("", "history", "history refs must be a list")
+            refs_list = [
+                ref if isinstance(ref, Reference) else Reference.model_validate(ref)
+                for ref in refs_data
+            ]
+
+        expected_behavior = extras.get("expectedBehavior", extras.get("expected_behavior"))
+        if expected_behavior is not None and not isinstance(expected_behavior, list):
+            raise ValidationError(
+                "",
+                "history",
+                "history expectedBehavior must be a list when provided",
+            )
+
+        history.append(
+            HistoryItem(
+                role=getattr(entry, "role"),
+                msg=message,
+                refs=refs_list,
+                expected_behavior=expected_behavior,
+            )
+        )
+    return history
+
+
+def apply_shared_update(
+    item: AgenticGroundTruthEntry,
+    *,
+    provided_fields: set[str],
+    comment: str | None = None,
+    history_entries: Sequence[Any] | None = None,
+    context_entries: list[ContextEntry] | None = None,
+    tool_calls: list[ToolCallRecord] | None = None,
+    expected_tools: ExpectedTools | None = None,
+    feedback: list[FeedbackEntry] | None = None,
+    metadata: dict[str, Any] | None = None,
+    plugins: dict[str, PluginPayload] | None = None,
+    trace_ids: dict[str, str] | None = None,
+    trace_payload: dict[str, Any] | None = None,
+    scenario_id: str | None = None,
+    manual_tags: list[str] | None = None,
+    status: GroundTruthStatus | str | None = None,
+    approve: bool = False,
+    actor_user_id: str,
+    legacy_update: LegacyCompatUpdate | None = None,
+    clear_assignment_on_statuses: set[GroundTruthStatus] | None = None,
+) -> UpdateMutationResult:
+    now = datetime.now(timezone.utc)
+    deletion_statuses = clear_assignment_on_statuses or set()
+    should_delete_assignment = False
+
+    if "comment" in provided_fields:
+        item.comment = comment or ""
+
+    if "history" in provided_fields:
+        if history_entries is None:
+            item.history = []
+            item.totalReferences = 0
+        else:
+            # HistoryItem is a subclass of HistoryEntry, so this is safe
+            item.history = cast(list[HistoryEntry], parse_history_entries(history_entries))
+            item.totalReferences = 0
+
+    if "context_entries" in provided_fields:
+        item.context_entries = context_entries or []
+    if "tool_calls" in provided_fields:
+        item.tool_calls = tool_calls or []
+    if "expected_tools" in provided_fields:
+        if expected_tools is None:
+            raise ValidationError(
+                "",
+                "expectedTools",
+                "expectedTools cannot be null; omit the field to leave it unchanged",
+            )
+        item.expected_tools = expected_tools
+    if "feedback" in provided_fields:
+        item.feedback = feedback or []
+    if "metadata" in provided_fields:
+        item.metadata = metadata or {}
+    if "plugins" in provided_fields:
+        item.plugins = plugins or {}
+    if "trace_ids" in provided_fields:
+        item.trace_ids = trace_ids
+    if "trace_payload" in provided_fields:
+        item.trace_payload = trace_payload or {}
+    if "scenario_id" in provided_fields:
+        item.scenario_id = scenario_id or ""
+    if "manual_tags" in provided_fields:
+        item.manual_tags = manual_tags or []
+
+    if legacy_update is not None:
+        if legacy_update.edited_question is not MISSING:
+            item.edited_question = cast(str | None, legacy_update.edited_question)
+        if legacy_update.answer is not MISSING:
+            item.answer = cast(str | None, legacy_update.answer)
+        if legacy_update.refs is not MISSING:
+            rag_compat_pack = get_rag_compat_pack()
+            rag_compat_pack.replace_references(
+                item, list(cast(list[Reference], legacy_update.refs))
+            )
+
+    if approve:
+        item.status = GroundTruthStatus.approved
+        item.reviewed_at = now
+        item.updatedBy = actor_user_id
+        if GroundTruthStatus.approved in deletion_statuses:
+            item.assignedTo = None
+            item.assigned_at = None
+            should_delete_assignment = True
+
+    if "status" in provided_fields:
+        item.status = _parse_status(status)
+        if item.status in deletion_statuses:
+            item.assignedTo = None
+            item.assigned_at = None
+            should_delete_assignment = True
+        if item.status == GroundTruthStatus.approved:
+            item.reviewed_at = now
+            item.updatedBy = actor_user_id
+
+    return UpdateMutationResult(should_delete_assignment=should_delete_assignment)
+
+
+async def persist_shared_update(
+    repo: Any,
+    item: AgenticGroundTruthEntry,
+    *,
+    if_match: str | None,
+    payload_etag: str | None,
+) -> None:
+    if not if_match and not payload_etag:
+        raise ETagRequiredError()
+    item.etag = if_match or payload_etag
+
+    if item.status == GroundTruthStatus.approved:
+        validate_item_for_approval(item)
+    apply_computed_tags(item)
+
+    try:
+        await repo.upsert_gt(item)
+    except ValueError as exc:
+        if str(exc) == "etag_mismatch":
+            raise ETagMismatchError() from exc
+        raise
diff --git a/backend/app/services/pii_service.py b/backend/app/services/pii_service.py
index eed7a66..7a0555c 100644
--- a/backend/app/services/pii_service.py
+++ b/backend/app/services/pii_service.py
@@ -11,11 +11,11 @@
 
 import re
 from dataclasses import dataclass
-from typing import Sequence
+from typing import Any, Sequence
 
 from pydantic import BaseModel, Field
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 
 
 class PIIWarning(BaseModel):
@@ -159,7 +159,7 @@ def scan_text_for_pii(text: str, field_name: str, item_id: str) -> list[PIIWarni
     return warnings
 
 
-def scan_item_for_pii(item: GroundTruthItem) -> list[PIIWarning]:
+def scan_item_for_pii(item: AgenticGroundTruthEntry) -> list[PIIWarning]:
     """Scan a ground truth item for PII in all relevant fields.
 
     Phase 1 scans:
@@ -178,6 +178,23 @@ def scan_item_for_pii(item: GroundTruthItem) -> list[PIIWarning]:
     item_id = item.id or "(no ID)"
     warnings: list[PIIWarning] = []
 
+    def scan_nested_value(value: Any, field_name: str) -> None:
+        if value is None:
+            return
+        if isinstance(value, BaseModel):
+            scan_nested_value(value.model_dump(by_alias=True, exclude_none=True), field_name)
+            return
+        if isinstance(value, str):
+            warnings.extend(scan_text_for_pii(value, field_name, item_id))
+            return
+        if isinstance(value, dict):
+            for key, nested in value.items():
+                scan_nested_value(nested, f"{field_name}.{key}")
+            return
+        if isinstance(value, Sequence) and not isinstance(value, (str, bytes, bytearray)):
+            for idx, nested in enumerate(value):
+                scan_nested_value(nested, f"{field_name}[{idx}]")
+
     # Scan primary text fields
     if item.synth_question:
         warnings.extend(scan_text_for_pii(item.synth_question, "synthQuestion", item_id))
@@ -197,10 +214,22 @@ def scan_item_for_pii(item: GroundTruthItem) -> list[PIIWarning]:
             if turn.msg:
                 warnings.extend(scan_text_for_pii(turn.msg, f"history[{idx}].msg", item_id))
 
+    if item.scenario_id:
+        warnings.extend(scan_text_for_pii(item.scenario_id, "scenarioId", item_id))
+
+    scan_nested_value(item.context_entries, "contextEntries")
+    scan_nested_value(item.tool_calls, "toolCalls")
+    scan_nested_value(item.expected_tools, "expectedTools")
+    scan_nested_value(item.feedback, "feedback")
+    scan_nested_value(item.metadata, "metadata")
+    scan_nested_value(item.plugins, "plugins")
+    scan_nested_value(item.trace_ids, "traceIds")
+    scan_nested_value(item.trace_payload, "tracePayload")
+
     return warnings
 
 
-def scan_bulk_items_for_pii(items: Sequence[GroundTruthItem]) -> list[PIIWarning]:
+def scan_bulk_items_for_pii(items: Sequence[AgenticGroundTruthEntry]) -> list[PIIWarning]:
     """Scan multiple ground truth items for PII.
 
     This is the main entry point for bulk import PII detection.
diff --git a/backend/app/services/search_service.py b/backend/app/services/search_service.py
index 494e3b1..368fd5c 100644
--- a/backend/app/services/search_service.py
+++ b/backend/app/services/search_service.py
@@ -1,11 +1,10 @@
 from __future__ import annotations
 
-from typing import Optional, TypedDict, cast
+from typing import Any, Optional, TypedDict, cast
 import logging
 
 from app.adapters.search.base import SearchAdapter
 from app.core.config import settings
-from app.domain.models import GroundTruthItem, Reference
 
 logger = logging.getLogger(__name__)
 
@@ -14,9 +13,17 @@ class SearchResult(TypedDict):
     url: Optional[str]
     title: Optional[str]
     chunk: Optional[str]
+    raw_payload: dict[str, Any]
 
 
 class SearchService:
+    """Generic search façade backed by a pluggable SearchAdapter.
+
+    This service handles the retrieval-query path (``/v1/search``) only.
+    Reference selection and attachment are RAG-compat concerns owned by
+    ``RagCompatPack``; they are not part of the generic core.
+    """
+
     # Canonical fields we care about from search backends
     RESULT_FIELDS: list[str] = ["url", "title", "chunk"]
 
@@ -28,17 +35,11 @@ def __init__(self, adapter: SearchAdapter | None = None) -> None:
         self.title_field = settings.SEARCH_FIELD_TITLE
         self.chunk_field = settings.SEARCH_FIELD_CHUNK
 
-    def attach_reference(self, item: GroundTruthItem, ref: Reference) -> GroundTruthItem:
-        # Attach to canonical 'refs' list.
-        item.refs.append(ref)
-        return item
-
-    def detach_reference(self, item: GroundTruthItem, ref_id: str) -> GroundTruthItem:
-        # Detach by URL if provided; otherwise drop nothing.
-        item.refs = [r for r in item.refs if getattr(r, "url", None) != ref_id]
-        return item
-
     async def query(self, q: str, top: int = 5) -> list[SearchResult]:
+        """Query the configured search backend and return normalized results.
+
+        Returns an empty list when no adapter is configured.
+        """
         if not self.adapter:
             return []
         # Delegate and normalize shape to {url, title}
@@ -50,7 +51,7 @@ async def query(self, q: str, top: int = 5) -> list[SearchResult]:
             url = cast(Optional[str], r.get(self.url_field))
             title = cast(Optional[str], r.get(self.title_field))
             chunk = cast(Optional[str], r.get(self.chunk_field))
-            normalized.append({"url": url, "title": title, "chunk": chunk})
+            normalized.append({"url": url, "title": title, "chunk": chunk, "raw_payload": dict(r)})
 
         logger.debug("search_service.normalized_results", extra={"count": len(normalized)})
         return normalized
diff --git a/backend/app/services/snapshot_service.py b/backend/app/services/snapshot_service.py
index 661e215..5af9e00 100644
--- a/backend/app/services/snapshot_service.py
+++ b/backend/app/services/snapshot_service.py
@@ -8,6 +8,7 @@
 
 from app.adapters.repos.base import GroundTruthRepo
 from app.domain.enums import GroundTruthStatus
+from app.domain.models import AgenticGroundTruthEntry
 from app.exports.models import ExportFilters, SnapshotExportRequest
 from app.exports.pipeline import ExportPipeline
 from app.exports.registry import ExportFormatterRegistry, ExportProcessorRegistry
@@ -21,20 +22,21 @@ def __init__(
         processor_registry: ExportProcessorRegistry,
         formatter_registry: ExportFormatterRegistry,
         default_processor_order: list[str],
+        plugin_export_transforms: list[Any] | None = None,
     ):
         self.repo = repo
         self.export_pipeline = export_pipeline
         self.processor_registry = processor_registry
         self.formatter_registry = formatter_registry
         self.default_processor_order = default_processor_order
+        self.plugin_export_transforms = plugin_export_transforms or []
 
-    async def collect_approved(self) -> list:
-        """Return all approved GroundTruthItems from the repository.
+    async def collect_approved(self) -> list[AgenticGroundTruthEntry]:
+        """Return all approved generic ground truth entries from the repository.
 
         Errors are allowed to surface to callers; no legacy fallbacks.
         """
-        items = await self.repo.list_all_gt(status=GroundTruthStatus.approved)
-        return items
+        return await self.repo.list_all_gt(status=GroundTruthStatus.approved)
 
     async def build_snapshot_payload(self) -> dict:
         """Build an in-memory JSON payload of approved items.
@@ -85,6 +87,9 @@ async def _collect_export_items(
         )
         for processor in processors:
             out_items = processor.process(out_items)
+        out_items = self.processor_registry.apply_transforms(
+            out_items, self.plugin_export_transforms
+        )
 
         filters_payload: dict[str, Any] = {"status": status_value}
         if dataset_names is not None:
diff --git a/backend/app/services/tagging_service.py b/backend/app/services/tagging_service.py
index 6d074cd..41326e0 100644
--- a/backend/app/services/tagging_service.py
+++ b/backend/app/services/tagging_service.py
@@ -7,7 +7,7 @@
 from app.plugins import get_default_registry, TagPluginRegistry
 
 if TYPE_CHECKING:
-    from app.domain.models import GroundTruthItem
+    from app.domain.models import AgenticGroundTruthEntry
 import logging
 
 logger = logging.getLogger(__name__)
@@ -122,7 +122,9 @@ def validate_tags_with_cache(tags: Iterable[str], valid_tags: set[str] | None) -
     return sorted(unique)
 
 
-def apply_computed_tags(item: GroundTruthItem, registry: TagPluginRegistry | None = None) -> None:
+def apply_computed_tags(
+    item: AgenticGroundTruthEntry, registry: TagPluginRegistry | None = None
+) -> None:
     """Compute and set the computed tags for an item.
 
     This mutates the item in place, setting:
@@ -130,7 +132,7 @@ def apply_computed_tags(item: GroundTruthItem, registry: TagPluginRegistry | Non
     - manual_tags: cleaned to remove any tags that are computed tag keys
 
     Args:
-        item: The GroundTruthItem to compute tags for.
+        item: The AgenticGroundTruthEntry to compute tags for.
         registry: Optional pre-fetched registry. If None, fetches default.
     """
     if registry is None:
diff --git a/backend/app/services/validation_service.py b/backend/app/services/validation_service.py
index 0d03d59..4cc05f7 100644
--- a/backend/app/services/validation_service.py
+++ b/backend/app/services/validation_service.py
@@ -1,15 +1,39 @@
-"""Validation service for ground truth items during bulk import."""
+"""Validation service for generic ground truth items during bulk import and approval."""
 
 from __future__ import annotations
 
 import asyncio
-
-from app.container import container
-from app.domain.models import GroundTruthItem, BulkImportError
 import logging
+
+from app.domain.models import AgenticGroundTruthEntry, BulkImportError, HistoryEntry
 from app.services.tagging_service import validate_tags_with_cache
 
 logger = logging.getLogger(__name__)
+container = None
+
+
+def _resolve_plugin_pack_registry(plugin_pack_registry=None):
+    if plugin_pack_registry is not None:
+        return plugin_pack_registry
+
+    global container
+    if container is None:
+        from app.container import container as runtime_container
+
+        container = runtime_container
+    return container.plugin_pack_registry
+
+
+def _resolve_tag_registry_service(tag_registry_service=None):
+    if tag_registry_service is not None:
+        return tag_registry_service
+
+    global container
+    if container is None:
+        from app.container import container as runtime_container
+
+        container = runtime_container
+    return container.tag_registry_service
 
 
 class ValidationError(Exception):
@@ -22,38 +46,118 @@ def __init__(self, item_id: str, field: str, message: str):
         super().__init__(f"Item '{item_id}': {field} - {message}")
 
 
+class ApprovalValidationError(Exception):
+    """Raised when a ground truth item fails generic approval validation."""
+
+    def __init__(self, errors: list[str]):
+        self.errors = errors
+        super().__init__("; ".join(errors))
+
+
+def _normalized_history(item: AgenticGroundTruthEntry) -> list[HistoryEntry]:
+    history = list(item.history or [])
+    question_text = item.edited_question or item.synth_question
+    if history:
+        roles = {entry.role.strip().lower() for entry in history}
+        if "user" not in roles and question_text:
+            history.insert(0, HistoryEntry(role="user", msg=question_text))
+        if "assistant" not in roles and item.answer:
+            history.append(HistoryEntry(role="assistant", msg=item.answer))
+        return history
+
+    synthesized: list[HistoryEntry] = []
+    if question_text:
+        synthesized.append(HistoryEntry(role="user", msg=question_text))
+    if item.answer:
+        synthesized.append(HistoryEntry(role="assistant", msg=item.answer))
+    return synthesized
+
+
+def collect_approval_validation_errors(item: AgenticGroundTruthEntry) -> list[str]:
+    """Return generic approval validation errors for an item.
+
+    The generic core enforces conversation integrity and expected-tool consistency.
+    Legacy RAG-shaped data is tolerated through compatibility helpers so existing
+    assignment/export flows remain intact during the migration.
+    """
+
+    errors: list[str] = []
+    history = _normalized_history(item)
+
+    if not history:
+        errors.append("history must contain at least one conversation message")
+    else:
+        user_messages = [entry for entry in history if entry.role.strip().lower() == "user"]
+        assistant_messages = [
+            entry for entry in history if entry.role.strip().lower() == "assistant"
+        ]
+        if not user_messages:
+            errors.append("history must include at least one user message")
+        if not assistant_messages:
+            errors.append("history must include at least one assistant message")
+
+    tool_call_names = {tool.name for tool in item.tool_calls if tool.name}
+    required_tools = [tool.name for tool in item.expected_tools.required if tool.name]
+
+    if item.tool_calls and not required_tools:
+        errors.append(
+            "expectedTools.required must include at least one tool before approval when toolCalls are present"
+        )
+
+    missing_required_tools = sorted(
+        {name for name in required_tools if name not in tool_call_names}
+    )
+    if missing_required_tools:
+        errors.append(
+            "expectedTools.required references toolCalls that do not exist: "
+            + ", ".join(missing_required_tools)
+        )
+
+    return errors
+
+
+def validate_item_for_approval(item: AgenticGroundTruthEntry, plugin_pack_registry=None) -> None:
+    registry = _resolve_plugin_pack_registry(plugin_pack_registry)
+    errors = collect_approval_validation_errors(item)
+    # Let plugin packs waive specific core errors (e.g. RagCompatPack waives
+    # the assistant-message requirement for retrieval-only items).
+    errors = registry.filter_core_errors(item, errors)
+    # Run plugin-pack approval hooks after the generic core checks.
+    # Each registered pack may contribute additional domain-specific errors
+    # (e.g. RagCompatPack enforcing per-retrieval-call selection completeness).
+    pack_errors = registry.collect_approval_errors(item)
+    errors.extend(pack_errors)
+    if errors:
+        raise ApprovalValidationError(errors)
+
+
 async def validate_ground_truth_item(
-    item: GroundTruthItem, item_index: int, valid_tags_cache: set[str] | None = None
+    item: AgenticGroundTruthEntry,
+    item_index: int,
+    valid_tags_cache: set[str] | None = None,
+    tag_registry_service=None,
 ) -> list[BulkImportError]:
     """Validate a ground truth item for bulk import.
 
     Returns a list of structured validation error objects. Empty list means valid.
-    Used instead of pydantic as tag validation require an async call for the cache
-
-    Validates:
-    - Manual tag values against the tag registry
-    - this can be extended to validate other field if needed
-
-    Args:
-        item: The ground truth item to validate
-        item_index: The 0-based index of the item in the request array
-        valid_tags_cache: Optional pre-fetched set of valid tags to avoid repeated lookups
+    Used instead of pydantic as tag validation requires an async call for the cache.
     """
+
     errors: list[BulkImportError] = []
     item_id = item.id or "(no ID)"
 
-    # Validate manual tags values (computed tags are system-generated and don't need validation)
+    # Validate manual tag values (computed tags are system-generated and don't need validation)
     if item.manual_tags:
-        # Fetch tags if not cached
+        registry_service = _resolve_tag_registry_service(tag_registry_service)
         if valid_tags_cache is None:
-            valid_tags_cache = set(await container.tag_registry_service.list_tags())
+            valid_tags_cache = set(await registry_service.list_tags())
         try:
-            # Use cached tags
             validate_tags_with_cache(item.manual_tags, valid_tags_cache)
             logger.debug(
-                f"Tag validation passed | item_id: {item_id} | manualTags: {item.manual_tags}"
+                "Tag validation passed | item_id: %s | manualTags: %s",
+                item_id,
+                item.manual_tags,
             )
-
         except ValueError as e:
             errors.append(
                 BulkImportError(
@@ -65,37 +169,52 @@ async def validate_ground_truth_item(
                 )
             )
             logger.warning(
-                f"Tag validation failed during bulk import | ID: {item_id} | Dataset: {item.datasetName} | ManualTags: {item.manual_tags} | Error: {str(e)}"
+                "Tag validation failed during bulk import | ID: %s | Dataset: %s | ManualTags: %s | Error: %s",
+                item_id,
+                item.datasetName,
+                item.manual_tags,
+                str(e),
             )
 
     return errors
 
 
-async def validate_bulk_items(items: list[GroundTruthItem]) -> dict[str, list[BulkImportError]]:
+async def validate_bulk_items(
+    items: list[AgenticGroundTruthEntry],
+    *,
+    tag_registry_service=None,
+) -> dict[int, list[BulkImportError]]:
     """Validate a list of ground truth items for bulk import.
 
-    Returns a dict mapping item ID to list of structured validation errors.
+    Returns a dict mapping request-position index to list of structured validation
+    errors.  Keyed by index rather than item.id so duplicate IDs in one request
+    do not collapse per-entry error attribution or undercount failed request entries.
     Items with no errors are not included in the result.
     """
-    validation_results: dict[str, list[BulkImportError]] = {}
 
-    # Fetch tag registry once for all items with manual tags
+    validation_results: dict[int, list[BulkImportError]] = {}
+
     valid_tags_cache: set[str] | None = None
     has_items_with_tags = any(item.manual_tags for item in items)
     if has_items_with_tags:
-        valid_tags_cache = set(await container.tag_registry_service.list_tags())
+        valid_tags_cache = set(
+            await _resolve_tag_registry_service(tag_registry_service).list_tags()
+        )
 
-    # Validate all items concurrently, passing index to each validator
     validation_tasks = [
-        validate_ground_truth_item(item, index, valid_tags_cache)
+        validate_ground_truth_item(
+            item,
+            index,
+            valid_tags_cache,
+            tag_registry_service=tag_registry_service,
+        )
         for index, item in enumerate(items)
     ]
 
     results = await asyncio.gather(*validation_tasks, return_exceptions=False)
 
-    # Collect errors
-    for item, errors in zip(items, results):
-        if errors:
-            validation_results[item.id] = errors
+    for index, item_errors in enumerate(results):
+        if item_errors:
+            validation_results[index] = item_errors
 
     return validation_results
diff --git a/backend/docs/multi-turn-refs.md b/backend/docs/multi-turn-refs.md
deleted file mode 100644
index d93ee83..0000000
--- a/backend/docs/multi-turn-refs.md
+++ /dev/null
@@ -1,192 +0,0 @@
-# Multi-Turn History with References
-
-## Overview
-
-This document describes the enhancement to the Ground Truth Curator backend to support storing references alongside agent messages in the multi-turn conversation history. This change maintains backward compatibility with the existing top-level `refs` field.
-
-## Changes Made
-
-### 1. Domain Model Updates
-
-**File: `app/domain/models.py`**
-
-Added an optional `refs` field to the `HistoryItem` model to allow storing references with agent responses:
-
-```python
-class HistoryItem(BaseModel):
-    """Represents a single item in the multi-turn history."""
-
-    role: HistoryItemRole  # User or Assistant
-    msg: str
-    refs: Optional[list[Reference]] = None  # References for agent messages
-    tags: list[str] = Field(default_factory=list)  # Optional tags for categorizing history items
-```
-
-### 2. API Endpoint Updates
-
-**Files: `app/api/v1/assignments.py` and `app/api/v1/ground_truths.py`**
-
-Both update endpoints now support receiving and parsing history items with embedded references and tags:
-
-- Added `history` field to `AssignmentUpdateRequest` model
-- Added history parsing logic that:
-  - Converts dict representations to `HistoryItem` models
-  - Parses and validates `refs` within each history item
-  - Parses optional `tags` array within each history item
-  - Supports both `msg` and `content` field names for compatibility
-  - Validates reference structure using the `Reference` model
-
-Example payload handling:
-
-```python
-if "history" in provided_fields and payload.history is not None:
-    history_items = []
-    for h in payload.history:
-        refs_data = h.get("refs")
-                refs_list = None
-                if refs_data is not None:
-                    refs_list = [
-                        r if isinstance(r, Reference) else Reference(**r)
-                        for r in refs_data
-                    ]
-                # Parse tags if present in the history item
-                tags_data = h.get("tags", [])
-                history_items.append(
-                    HistoryItem(
-                        role=h["role"],
-                        msg=h.get("msg") or h.get("content", ""),
-                        refs=refs_list,
-                        tags=tags_data if isinstance(tags_data, list) else [],
-                    )
-                )
-    it.history = history_items
-```
-
-## Backward Compatibility
-
-The changes maintain full backward compatibility:
-
-1. **Top-level `refs` field preserved**: The `GroundTruthItem.refs` field at the top level remains unchanged and continues to work as before.
-
-2. **Optional refs in history**: The `refs` field in `HistoryItem` is optional (defaults to `None`), so existing history items without refs continue to work.
-
-3. **Optional tags in history**: The `tags` field in `HistoryItem` is optional (defaults to an empty list), so existing history items without tags continue to work.
-
-4. **Flexible field names**: The parser supports both `msg` and `content` field names for the message text, accommodating different client implementations.
-
-## Data Structure
-
-### Example with refs and tags in history
-
-```json
-{
-  "id": "example-1",
-  "datasetName": "demo",
-  "synthQuestion": "How do I use this product?",
-  "answer": "It is a powerful CAD tool...", 
-  "refs": [
-    {
-      "url": "https://example.com/doc1",
-      "content": "General documentation"
-    }
-  ],
-  "history": [
-    {
-      "role": "user",
-      "msg": "What is this product?",
-      "tags": ["introduction", "basic-info"]
-    },
-    {
-      "role": "assistant",
-      "msg": "It is a CAD software...",
-      "refs": [
-        {
-          "url": "https://example.com/intro",
-          "content": "Introduction",
-          "keyExcerpt": "It is a comprehensive 3D CAD solution"
-        }
-      ],
-      "tags": ["product-overview"]
-    },
-    {
-      "role": "user",
-      "msg": "How do I install it?",
-      "tags": ["installation", "getting-started"]
-    },
-    {
-      "role": "assistant",
-      "msg": "To install the product, follow these steps...",
-      "refs": [
-        {
-          "url": "https://example.com/install",
-          "content": "Installation guide",
-          "type": "kb"
-        }
-      ],
-      "tags": ["installation", "step-by-step"]
-    }
-  ]
-}
-```
-
-## Frontend Integration
-
-The frontend already has support for tracking which agent turn references belong to via the `turnIndex` field on the `Reference` type. This backend change complements that by allowing the references to be stored directly with the history items.
-
-### Mapping between frontend and backend
-
-- **Frontend**: Uses `turnIndex` on references to indicate which turn they belong to
-- **Backend**: Stores references directly within the `HistoryItem` that generated them
-
-When syncing between frontend and backend:
-
-1. Frontend can group references by `turnIndex` and include them in the corresponding history item when saving
-2. Backend stores these references with the history item
-3. When loading, backend returns history items with embedded refs
-4. Frontend can extract refs and set the appropriate `turnIndex` based on the history position
-
-## Testing
-
-A comprehensive test suite has been added in `tests/unit/test_history_with_refs.py` that validates:
-
-- Creating history items with refs
-- Creating history items without refs (optional field)
-- Serialization (model_dump)
-- Deserialization (from dict)
-- Both user and agent messages with/without refs
-
-All tests pass successfully.
-
-## Use Cases
-
-This enhancement enables several important use cases:
-
-1. **Multi-turn conversations**: Each agent response can include its own set of references, making it clear which sources were used for each part of the conversation.
-
-2. **Reference tracking**: Track exactly which documents informed each response in a multi-turn dialogue.
-
-3. **Categorization and filtering**: Use tags to categorize history items by topic, intent, or any other custom dimension. This enables filtering and analysis of conversations by tag.
-
-4. **Evaluation**: Enable more precise evaluation of multi-turn conversations by associating references and tags with specific turns.
-
-5. **Transparency**: Provide users with clear attribution for each agent response in a conversation.
-
-6. **Content organization**: Tags can be used to mark special types of interactions (e.g., "clarification", "follow-up", "technical", "non-technical", "escalation").
-
-## Migration Notes
-
-No migration is required for existing data:
-
-- Existing items without history continue to work
-- Existing items with history but no refs in history items continue to work
-- The top-level `refs` field remains the primary location for references in single-turn items
-
-## Future Enhancements
-
-Potential future improvements:
-
-1. **Automatic reference aggregation**: Logic to automatically collect all refs from history items and merge them into the top-level `refs` field for backward compatibility with consumers that only read the top-level field.
-
-2. **Reference deduplication**: Logic to deduplicate references across multiple turns in a conversation.
-
-3. **Turn-specific reference validation**: Validate that references are appropriate for the specific turn they're associated with.
diff --git a/backend/environments/integration-tests.env b/backend/environments/integration-tests.env
index 57c7690..feab127 100644
--- a/backend/environments/integration-tests.env
+++ b/backend/environments/integration-tests.env
@@ -7,8 +7,7 @@
 # tags or reinitialize repos during app lifespan.
 GTC_COSMOS_TEST_MODE=true
 
-# Keep chat and search disabled unless explicitly configured in local.env or CI.
-GTC_CHAT_ENABLED=false
+# Keep search disabled unless explicitly configured in local.env or CI.
 
 # Avoid Easy Auth in integration tests unless explicitly enabled.
 GTC_EZAUTH_ENABLED=false
diff --git a/backend/environments/sample.env b/backend/environments/sample.env
index e991e40..2a421db 100644
--- a/backend/environments/sample.env
+++ b/backend/environments/sample.env
@@ -51,6 +51,8 @@ GTC_COSMOS_TEST_MODE=false
 # --- Optional: Telemetry (Azure Monitor / App Insights) ---
 # Keep disabled in the committed sample. Enable in local.env if needed.
 GTC_AZ_MONITOR_ENABLED=false
+# Enable local harness JSONL request/trace mirrors when running the repo harness.
+GTC_HARNESS_JSONL_ENABLED=false
 # GTC_AZ_MONITOR_CONNECTION_STRING=InstrumentationKey=...;IngestionEndpoint=...
 # Or use the standard Azure var name:
 # APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=...;IngestionEndpoint=...
@@ -61,11 +63,3 @@ GTC_EZAUTH_ALLOW_ANONYMOUS_PATHS=/healthz,/metrics
 # GTC_EZAUTH_ALLOWED_EMAIL_DOMAINS=example.com
 # GTC_EZAUTH_ALLOWED_OBJECT_IDS=00000000-0000-0000-0000-000000000000
 GTC_EZAUTH_HEADER_SOURCE=aca
-
-# --- Optional: Agent chat / retrieval ---
-# Disable by default so local runs do not require Foundry/retrieval configuration.
-GTC_CHAT_ENABLED=false
-# GTC_AZURE_AI_PROJECT_ENDPOINT=https://<your-ai-services>.services.ai.azure.com/api/projects/<project>
-# GTC_AZURE_AI_AGENT_ID=<agent-id>
-# GTC_RETRIEVAL_URL=https://<your-retrieval-service>/api/retrieve-default
-# GTC_RETRIEVAL_PERMISSIONS_SCOPE=api://<app-id>/.default
diff --git a/backend/tests/integration/conftest.py b/backend/tests/integration/conftest.py
index 46e1c80..000a315 100644
--- a/backend/tests/integration/conftest.py
+++ b/backend/tests/integration/conftest.py
@@ -261,22 +261,9 @@ def configure_ezauth_for_tests():
 
 def pytest_collection_modifyitems(config: pytest.Config, items: list[pytest.Item]) -> None:
     """Apply default skips for integration tests that require external services."""
-    chat_configured = bool(
-        settings.AZURE_AI_PROJECT_ENDPOINT and settings.AZURE_AI_AGENT_ID and settings.CHAT_ENABLED
-    )
     search_configured = bool(settings.AZ_SEARCH_ENDPOINT and settings.AZ_SEARCH_INDEX)
 
     for item in items:
-        if item.get_closest_marker("requires_chat") and not chat_configured:
-            item.add_marker(
-                pytest.mark.skip(
-                    reason=(
-                        "Azure AI Foundry agent not configured; set "
-                        "GTC_AZURE_AI_PROJECT_ENDPOINT, GTC_AZURE_AI_AGENT_ID, "
-                        "and GTC_CHAT_ENABLED=true"
-                    )
-                )
-            )
         if item.get_closest_marker("requires_search") and not search_configured:
             item.add_marker(
                 pytest.mark.skip(
diff --git a/backend/tests/integration/test_assignments_cosmos.py b/backend/tests/integration/test_assignments_cosmos.py
index 73ae1c4..25dde5d 100644
--- a/backend/tests/integration/test_assignments_cosmos.py
+++ b/backend/tests/integration/test_assignments_cosmos.py
@@ -3,7 +3,7 @@
 import pytest
 import uuid
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.container import container
 from app.adapters.repos.cosmos_repo import CosmosGroundTruthRepo
 
@@ -57,7 +57,7 @@ async def test_assigned_ground_truths_update_and_approve(async_client: AsyncClie
     assert r.status_code == 200
     data: dict = r.json()
     # mypy: data.get returns Optional[Any]; use default [] to ensure list type
-    adocs = TypeAdapter(list[GroundTruthItem]).validate_python(data.get("assigned") or [])
+    adocs = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(data.get("assigned") or [])
     assert adocs and len(adocs) >= 1
     gt_id = adocs[0].id
 
@@ -103,7 +103,7 @@ async def assigned_ground_truth(async_client: AsyncClient, user_headers):
     )
     assert r.status_code == 200
     data = r.json()
-    adocs = TypeAdapter(list[GroundTruthItem]).validate_python(data.get("assigned") or [])
+    adocs = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(data.get("assigned") or [])
     assert adocs and len(adocs) >= 1
     gt = adocs[0]
 
diff --git a/backend/tests/integration/test_assignments_edited_question_persist_cosmos.py b/backend/tests/integration/test_assignments_edited_question_persist_cosmos.py
index 4f42720..e339fe9 100644
--- a/backend/tests/integration/test_assignments_edited_question_persist_cosmos.py
+++ b/backend/tests/integration/test_assignments_edited_question_persist_cosmos.py
@@ -31,12 +31,17 @@ def make_item(dataset: str, item_id: str) -> dict[str, Any]:
 async def test_assignments_put_persists_edited_question_camel_case(
     async_client: AsyncClient, user_headers: dict[str, str]
 ):
-    """Ensure that providing editedQuestion (camelCase) in the assignments PUT body
-    updates and persists the field (alias -> edited_question) and is reflected on
-    subsequent GET/list responses.
+    """Compat-migration coverage for the temporary editedQuestion alias path.
 
-    Regression coverage for previous bug where only snake_case 'edited_question'
-    was checked, causing the field to be dropped.
+    This test stays only while assignments updates still project legacy camelCase
+    question fields across the compatibility boundary. Delete it with the alias
+    retirement work in the hard-delete phase.
+
+    **Phase 5 Audit (2026-03-12)**: MIGRATION TEST - INFORMATIONAL
+    This test validates that editedQuestion persists correctly through Cosmos
+    round-trips. The test is marked as temporary and should be deleted when
+    Phase 6 removes legacy field support. Not a delete blocker, but documents
+    current persistence contract.
     """
     dataset = f"editedq-{uuid4().hex[:6]}"
     item_id = "gt-1"
diff --git a/backend/tests/integration/test_assignments_flow_cosmos.py b/backend/tests/integration/test_assignments_flow_cosmos.py
index 9ebf13d..785a2c2 100644
--- a/backend/tests/integration/test_assignments_flow_cosmos.py
+++ b/backend/tests/integration/test_assignments_flow_cosmos.py
@@ -7,7 +7,7 @@
 import pytest
 from httpx import AsyncClient
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 
 
 def make_item(dataset: str) -> dict[str, Any]:
@@ -41,13 +41,15 @@ async def test_self_serve_list_and_approve(async_client: AsyncClient, user_heade
     assert r.status_code == 200
     resp = cast(dict[str, Any], r.json())
     assert resp.get("assignedCount") == 2
-    assigned = TypeAdapter(list[GroundTruthItem]).validate_python(resp.get("assigned") or [])
+    assigned = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(
+        resp.get("assigned") or []
+    )
     assert len(assigned) == 2
 
     # List my assignments
     r = await async_client.get("/v1/assignments/my", headers=user_headers)
     assert r.status_code == 200
-    docs = TypeAdapter(list[GroundTruthItem]).validate_python(r.json())
+    docs = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(r.json())
     assert len(docs) == 2
 
     # Approve first via assignments PUT
diff --git a/backend/tests/integration/test_assignments_retry_exclusion.py b/backend/tests/integration/test_assignments_retry_exclusion.py
index fa3cf47..5232ae9 100644
--- a/backend/tests/integration/test_assignments_retry_exclusion.py
+++ b/backend/tests/integration/test_assignments_retry_exclusion.py
@@ -15,7 +15,7 @@
 from pydantic.type_adapter import TypeAdapter
 import pytest
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.adapters.repos.cosmos_repo import CosmosGroundTruthRepo
 
 
@@ -110,7 +110,9 @@ async def test_skipped_items_excluded_from_user_resampling(
         "/v1/assignments/self-serve", json={"limit": 2}, headers=user_headers
     )
     assert r.status_code == 200
-    first_batch = TypeAdapter(list[GroundTruthItem]).validate_python(r.json().get("assigned") or [])
+    first_batch = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(
+        r.json().get("assigned") or []
+    )
     assert len(first_batch) == 2
 
     # Skip one item
@@ -127,7 +129,7 @@ async def test_skipped_items_excluded_from_user_resampling(
         "/v1/assignments/self-serve", json={"limit": 3}, headers=user_headers
     )
     assert r.status_code == 200
-    second_batch = TypeAdapter(list[GroundTruthItem]).validate_python(
+    second_batch = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(
         r.json().get("assigned") or []
     )
     assert len(second_batch) == 3
diff --git a/backend/tests/integration/test_assignments_skipped_reassign_cosmos.py b/backend/tests/integration/test_assignments_skipped_reassign_cosmos.py
index d3e59b2..a19faf9 100644
--- a/backend/tests/integration/test_assignments_skipped_reassign_cosmos.py
+++ b/backend/tests/integration/test_assignments_skipped_reassign_cosmos.py
@@ -8,7 +8,7 @@
 from pydantic.type_adapter import TypeAdapter
 import pytest
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 
 
 def make_skipped_item(dataset: str, assigned_to: str) -> dict[str, Any]:
@@ -48,7 +48,7 @@ async def test_self_serve_reassigns_skipped_and_lists_in_my(
     payload = cast(dict[str, Any], r.json())
     assert payload.get("assignedCount") == 1
 
-    assigned_items = TypeAdapter(list[GroundTruthItem]).validate_python(
+    assigned_items = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(
         payload.get("assigned") or []
     )
     assert len(assigned_items) == 1
@@ -63,7 +63,7 @@ async def test_self_serve_reassigns_skipped_and_lists_in_my(
     # /my should list the item now (since it filters by assignedTo == user and status == draft)
     r = await async_client.get("/v1/assignments/my", headers=user_headers)
     assert r.status_code == 200
-    my_items = TypeAdapter(list[GroundTruthItem]).validate_python(r.json())
+    my_items = TypeAdapter(list[AgenticGroundTruthEntry]).validate_python(r.json())
     assert len(my_items) == 1
     assert my_items[0].id == gt.id
     assert my_items[0].assignedTo == expected_user
diff --git a/backend/tests/integration/test_chat_agent_live.py b/backend/tests/integration/test_chat_agent_live.py
deleted file mode 100644
index 56e9e11..0000000
--- a/backend/tests/integration/test_chat_agent_live.py
+++ /dev/null
@@ -1,139 +0,0 @@
-from __future__ import annotations
-
-from typing import cast
-
-import pytest
-from httpx import AsyncClient
-from httpx import ASGITransport
-
-from app.core.config import settings
-from app.main import create_app
-
-
-# Override the autouse Cosmos cleanup from tests/integration/conftest.py so this
-# test does not require a running Cosmos Emulator just to exercise the agent.
-@pytest.fixture(scope="function", autouse=True)
-async def clear_cosmos_db():
-    yield
-
-
-# Provide a local async_client that doesn't depend on the Cosmos-backed app fixture.
-@pytest.fixture(scope="function")
-async def async_client():
-    app = create_app()
-    transport = ASGITransport(app=app)
-    client = AsyncClient(transport=transport, base_url="http://testserver")
-    try:
-        yield client
-    finally:
-        try:
-            await client.aclose()
-        finally:
-            try:
-                await transport.aclose()
-            except Exception:
-                pass
-
-
-@pytest.mark.no_seed_tags
-@pytest.mark.requires_chat
-@pytest.mark.anyio
-async def test_chat_endpoint_hits_azure_ai_agent(
-    async_client: AsyncClient, user_headers: dict[str, str]
-):
-    """End-to-end call that hits the configured Azure AI Foundry Agent Service.
-
-    This test will be skipped unless the integration environment provides
-    valid Azure AI Foundry agent settings and the chat feature is enabled.
-
-    Tests that the agent can respond to questions and validates the response structure.
-
-    CITATIONS/REFERENCES:
-    For the agent to return citations, it must be configured in Azure AI Foundry with
-    one or more search/grounding tools:
-
-    1. Azure AI Search:
-       - Add the "azure_ai_search" tool to the agent
-       - Configure connection to an Azure AI Search index
-       - Requires the search index to be populated with documents
-
-    2. File Search:
-       - Add the "file_search" tool to the agent
-       - Upload files to a vector store
-       - Attach the vector store to the agent
-
-    3. Bing Grounding:
-       - Enable Bing grounding in the agent configuration
-       - Requires Bing Search API resource
-
-    Without any of these tools, the agent will respond from its training data
-    without citations. The test validates the response structure but does not
-    require citations to pass.
-    """
-    # Ensure Azure AI Foundry agent is configured; otherwise skip gracefully.
-    if not (
-        settings.AZURE_AI_PROJECT_ENDPOINT and settings.AZURE_AI_AGENT_ID and settings.CHAT_ENABLED
-    ):
-        pytest.skip(
-            "Azure AI Foundry agent not configured; set GTC_AZURE_AI_PROJECT_ENDPOINT, "
-            "GTC_AZURE_AI_AGENT_ID, and GTC_CHAT_ENABLED=true"
-        )
-
-    # Use a question that would likely trigger a search if the agent has search tools
-    # Ask about a specific technical topic that should be in the indexed documents
-    r = await async_client.post(
-        "/v1/chat",
-        json={
-            "message": "What are the key features and capabilities of the product? Please provide specific details from the documentation.",
-        },
-        headers=user_headers,
-    )
-
-    # If Azure AI credentials are invalid or the service is unreachable, the API maps
-    # adapter errors to 502 or 503. Treat that as a skip so CI without creds doesn't fail.
-    if r.status_code in (502, 503):
-        pytest.skip("Azure AI Foundry agent not reachable or unauthorized; skipping live test")
-
-    assert r.status_code == 200
-    body = cast(dict[str, object], r.json())
-
-    # Verify response structure
-    assert "content" in body, "Response should have 'content' field"
-    assert isinstance(body["content"], str), "Content should be a string"
-    assert len(body["content"]) > 0, "Content should not be empty"
-
-    # Sanity bound: ensure content isn't outrageously large
-    assert len(cast(str, body["content"])) <= 50000, "Content should be under 50k characters"
-
-    # Verify references field exists
-    assert "references" in body, "Response should have 'references' field"
-    assert isinstance(body["references"], list), "References should be a list"
-
-    # Log reference count for debugging
-    ref_count = len(body["references"])
-    print(f"\n✓ Agent returned {ref_count} reference(s)")
-
-    # If there are references, validate their structure
-    if ref_count > 0:
-        print("✓ Agent has search tools enabled and returned citations")
-        for i, ref in enumerate(body["references"]):
-            assert isinstance(ref, dict), f"Reference {i} should be a dict"
-            # Validate reference has expected fields
-            assert "id" in ref or "url" in ref, f"Reference {i} should have either 'id' or 'url'"
-            if ref.get("snippet"):
-                print(f"  Reference {i + 1}: {ref.get('snippet', '')[:100]}...")
-            elif ref.get("url"):
-                print(f"  Reference {i + 1}: {ref.get('url')}")
-    else:
-        # No references - this could mean:
-        # 1. Agent doesn't have search tools enabled
-        # 2. Agent chose not to use search for this query
-        # 3. Search didn't return results
-        print("⚠ Warning: No references returned. This may indicate:")
-        print("  - Agent doesn't have search tools (Azure AI Search, file_search, or Bing) enabled")
-        print("  - Agent chose not to search for this particular query")
-        print("  - Search tools didn't return any relevant documents")
-        print(f"\nAgent response preview: {body['content'][:200]}...")
-
-        # Don't fail the test, but print a warning
-        # In a real scenario, you'd want to verify with Azure portal that search tools are configured
diff --git a/backend/tests/integration/test_ground_truths_cosmos.py b/backend/tests/integration/test_ground_truths_cosmos.py
index 469ae99..f3ac5c6 100644
--- a/backend/tests/integration/test_ground_truths_cosmos.py
+++ b/backend/tests/integration/test_ground_truths_cosmos.py
@@ -69,6 +69,38 @@ async def test_update_with_etag(async_client: AsyncClient, user_headers):
     assert res["status"] == GroundTruthStatus.approved.value
 
 
+@pytest.mark.anyio
+async def test_update_to_approved_sets_review_metadata(async_client: AsyncClient, user_headers):
+    dataset = "test-ds-approve-metadata"
+    item = make_item(dataset)
+    item["history"] = [
+        {"role": "user", "msg": "What is the capital of France?"},
+        {"role": "assistant", "msg": "Paris."},
+    ]
+
+    r = await async_client.post("/v1/ground-truths", json=[item], headers=user_headers)
+    assert r.status_code == 200
+
+    r = await async_client.get(f"/v1/ground-truths/{dataset}", headers=user_headers)
+    assert r.status_code == 200
+    data = r.json()
+    etag = data[0]["_etag"]
+    bucket = data[0]["bucket"]
+
+    headers = dict(user_headers)
+    headers.update({"If-Match": etag})
+    r = await async_client.put(
+        f"/v1/ground-truths/{dataset}/{bucket}/{item['id']}",
+        json={"status": "approved"},
+        headers=headers,
+    )
+    assert r.status_code == 200
+    body = r.json()
+    assert body["status"] == GroundTruthStatus.approved.value
+    assert body["reviewedAt"] is not None
+    assert body["updatedBy"] == "tester@example.com"
+
+
 @pytest.mark.anyio
 async def test_delete_item_and_dataset(async_client: AsyncClient, user_headers):
     dataset = "test-ds-delete"
@@ -132,18 +164,39 @@ async def test_snapshot_and_stats(async_client: AsyncClient, user_headers):
 @pytest.mark.anyio
 async def test_import_with_approve_flag(async_client: AsyncClient, user_headers):
     dataset = "test-approve-on-import"
-    item = make_item(dataset)
-    # Import with approve=true so items are automatically approved
-    r = await async_client.post("/v1/ground-truths?approve=true", json=[item], headers=user_headers)
+
+    # Item WITHOUT history: approval validation should reject it
+    invalid_item = make_item(dataset)
+    r = await async_client.post(
+        "/v1/ground-truths?approve=true", json=[invalid_item], headers=user_headers
+    )
+    assert r.status_code == 200
+    data = r.json()
+    # Approval-invalid items must NOT be imported
+    assert data.get("imported") == 0
+    assert data.get("failed") == 1
+    assert any(e.get("code") == "APPROVAL_VALIDATION_FAILED" for e in data.get("errors", []))
+
+    # Item WITH history: approval validation should accept it
+    valid_item = make_item(dataset)
+    valid_item["history"] = [
+        {"role": "user", "msg": "What is the capital of France?"},
+        {"role": "assistant", "msg": "The capital of France is Paris."},
+    ]
+    r = await async_client.post(
+        "/v1/ground-truths?approve=true", json=[valid_item], headers=user_headers
+    )
     assert r.status_code == 200
     data = r.json()
     assert data.get("imported") == 1
+    assert data.get("failed") == 0
 
-    # Verify the item is approved on read
+    # Verify the valid item is approved on read
     r = await async_client.get(f"/v1/ground-truths/{dataset}", headers=user_headers)
     assert r.status_code == 200
     lst = r.json()
-    assert lst and lst[0]["status"] == GroundTruthStatus.approved.value
+    approved = [i for i in lst if i["id"] == valid_item["id"]]
+    assert approved and approved[0]["status"] == GroundTruthStatus.approved.value
 
     # Stats should include at least one approved
     r = await async_client.get("/v1/ground-truths/stats", headers=user_headers)
diff --git a/backend/tests/integration/test_ground_truths_get_and_filters_cosmos.py b/backend/tests/integration/test_ground_truths_get_and_filters_cosmos.py
index 3339841..3e9712f 100644
--- a/backend/tests/integration/test_ground_truths_get_and_filters_cosmos.py
+++ b/backend/tests/integration/test_ground_truths_get_and_filters_cosmos.py
@@ -7,7 +7,7 @@
 from uuid import uuid4
 from httpx import AsyncClient
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 
 
 def make_item(dataset: str, *, gid: Optional[str] = None) -> dict[str, Any]:
@@ -50,7 +50,7 @@ async def test_get_item_200_and_404(async_client: AsyncClient, user_headers: dic
         f"/v1/ground-truths/{dataset}/{bucket}/gt-200", headers=user_headers
     )
     assert res.status_code == 200
-    gt_item = TypeAdapter(GroundTruthItem).validate_python(res.json())
+    gt_item = TypeAdapter(AgenticGroundTruthEntry).validate_python(res.json())
     assert gt_item.id == "gt-200"
     assert gt_item.etag
 
diff --git a/backend/tests/integration/test_inference_service_live.py b/backend/tests/integration/test_inference_service_live.py
deleted file mode 100644
index fd80c95..0000000
--- a/backend/tests/integration/test_inference_service_live.py
+++ /dev/null
@@ -1,314 +0,0 @@
-"""
-Integration tests for GTCInferenceAdapter with live Azure AI Foundry Agent.
-
-Tests the agent calling flow with FunctionTool-based retrieval.
-These tests hit real Azure AI services and require proper configuration.
-
-Prerequisites:
-- GTC_AZURE_AI_PROJECT_ENDPOINT: Azure AI Project endpoint
-- GTC_AZURE_AI_AGENT_ID: Agent ID configured with call_retrieval_tool function
-- GTC_RETRIEVAL_URL: Retrieval service URL
-- GTC_RETRIEVAL_PERMISSIONS_SCOPE: OAuth scope for retrieval service
-- GTC_CHAT_ENABLED=true
-
-The agent must be configured in Azure AI Foundry with:
-- A function tool named "call_retrieval_tool" with parameters:
-  - query: string - The search query text
-  - code: string - Code identifier for the search configuration
-"""
-
-from __future__ import annotations
-
-import os
-
-import pytest
-
-from app.core.config import settings
-from app.adapters.gtc_inference_adapter import GTCInferenceAdapter
-
-
-_RUN_LIVE_TESTS = os.getenv("GTC_RUN_LIVE_AZURE_TESTS", "").strip().lower() in {"1", "true", "yes"}
-
-pytestmark = pytest.mark.skipif(
-    not _RUN_LIVE_TESTS,
-    reason=(
-        "Live Azure AI Foundry tests are disabled by default; set "
-        "GTC_RUN_LIVE_AZURE_TESTS=1 to enable."
-    ),
-)
-
-
-# Override the autouse Cosmos cleanup from tests/integration/conftest.py so this
-# test does not require a running Cosmos Emulator.
-@pytest.fixture(scope="function", autouse=True)
-async def clear_cosmos_db():
-    yield
-
-
-@pytest.fixture(scope="function")
-def inference_service():
-    """Create GTCInferenceAdapter with live configuration.
-
-    Skips if required configuration is not available.
-    """
-    if not settings.AZURE_AI_PROJECT_ENDPOINT:
-        pytest.skip("GTC_AZURE_AI_PROJECT_ENDPOINT not configured")
-    if not settings.AZURE_AI_AGENT_ID:
-        pytest.skip("GTC_AZURE_AI_AGENT_ID not configured")
-    if not settings.CHAT_ENABLED:
-        pytest.skip("GTC_CHAT_ENABLED is not true")
-    if not settings.RETRIEVAL_URL:
-        pytest.skip("GTC_RETRIEVAL_URL not configured")
-    if not settings.RETRIEVAL_PERMISSIONS_SCOPE:
-        pytest.skip("GTC_RETRIEVAL_PERMISSIONS_SCOPE not configured")
-
-    service = GTCInferenceAdapter(
-        project_endpoint=settings.AZURE_AI_PROJECT_ENDPOINT,
-        agent_id=settings.AZURE_AI_AGENT_ID,
-        retrieval_url=settings.RETRIEVAL_URL,
-        permissions_scope=settings.RETRIEVAL_PERMISSIONS_SCOPE,
-        timeout_seconds=settings.RETRIEVAL_TIMEOUT_SECONDS,
-    )
-
-    try:
-        yield service
-    finally:
-        service.close()
-
-
-@pytest.mark.no_seed_tags
-@pytest.mark.anyio
-async def test_inference_service_generate_calls_agent(inference_service: GTCInferenceAdapter):
-    """Test that GTCInferenceAdapter.generate() calls the Azure AI agent and returns a response.
-
-    This test validates:
-    1. The agent is reachable and responds
-    2. The response contains 'content' (assistant reply)
-    3. The response contains 'references' (list, possibly empty)
-
-    The agent should use the FunctionTool (call_retrieval_tool) if configured,
-    but the test doesn't require references to pass.
-    """
-    import asyncio
-
-    # Run the sync generate method in a thread pool (as would be done in async context)
-    result = await asyncio.to_thread(
-        inference_service.generate,
-        user_id="test-user",
-        message="What are the main features of this product? Please search for relevant documentation.",
-    )
-
-    # Verify response structure
-    assert "content" in result, "Response should have 'content' field"
-    assert isinstance(result["content"], str), "Content should be a string"
-    assert len(result["content"]) > 0, "Content should not be empty"
-
-    # Verify references field
-    assert "references" in result, "Response should have 'references' field"
-    assert isinstance(result["references"], list), "References should be a list"
-
-    # Log for debugging
-    print(f"\n✓ Agent returned response with {len(result['references'])} reference(s)")
-    print(f"  Content length: {len(result['content'])} characters")
-
-    if result["references"]:
-        print("✓ Agent used retrieval tool and returned citations")
-        for i, ref in enumerate(result["references"][:3]):  # Show first 3
-            ref_id = ref.get("id", "no-id")
-            snippet = ref.get("snippet", "")[:80] if ref.get("snippet") else ""
-            print(f"  Reference {i + 1}: {ref_id} - {snippet}...")
-    else:
-        print("⚠ No references returned. Agent may not have used retrieval tool.")
-        print(f"  Response preview: {result['content'][:200]}...")
-
-
-@pytest.mark.no_seed_tags
-@pytest.mark.anyio
-async def test_inference_service_generate_with_technical_question(
-    inference_service: GTCInferenceAdapter,
-):
-    """Test agent response to a specific technical question that should trigger retrieval.
-
-    Uses a question that would require searching documentation to answer properly.
-    """
-    import asyncio
-
-    result = await asyncio.to_thread(
-        inference_service.generate,
-        user_id="test-user-tech",
-        message="How do I configure assembly constraints in a CAD application? What are the different types of constraints available?",
-    )
-
-    # Basic structure validation
-    assert "content" in result
-    assert isinstance(result["content"], str)
-    assert len(result["content"]) > 50, "Technical answer should be substantial"
-
-    assert "references" in result
-    assert isinstance(result["references"], list)
-
-    # For technical questions, we'd expect references if the retrieval is working
-    ref_count = len(result["references"])
-    print(f"\n✓ Technical question answered with {ref_count} reference(s)")
-
-    if ref_count > 0:
-        # Validate reference structure
-        for ref in result["references"]:
-            # Each reference should have at least an id or url
-            assert ref.get("id") or ref.get("url"), "Reference should have id or url"
-
-
-@pytest.mark.no_seed_tags
-@pytest.mark.anyio
-async def test_inference_service_handles_empty_message_gracefully(
-    inference_service: GTCInferenceAdapter,
-):
-    """Test that the service handles edge cases gracefully.
-
-    Note: The agent may still respond to a simple greeting or very short message,
-    but it should not crash.
-    """
-    import asyncio
-
-    result = await asyncio.to_thread(
-        inference_service.generate, user_id="test-user-edge", message="Hi"
-    )
-
-    # Should still return a valid response structure
-    assert "content" in result
-    assert "references" in result
-    print(f"\n✓ Simple greeting handled, response: {result['content'][:100]}...")
-
-
-@pytest.mark.no_seed_tags
-@pytest.mark.anyio
-async def test_inference_service_multiple_sequential_calls(inference_service: GTCInferenceAdapter):
-    """Test that the service handles multiple sequential calls correctly.
-
-    Each call should create a new thread and clean up properly.
-    """
-    import asyncio
-
-    questions = [
-        "What is the main product feature?",
-        "How does version control work in the application?",
-    ]
-
-    for i, question in enumerate(questions):
-        result = await asyncio.to_thread(
-            inference_service.generate, user_id=f"test-user-multi-{i}", message=question
-        )
-
-        assert "content" in result
-        assert "references" in result
-        print(f"\n✓ Question {i + 1}/{len(questions)} answered successfully")
-
-
-@pytest.mark.no_seed_tags
-@pytest.mark.anyio
-async def test_inference_service_response_content_bounds(inference_service: GTCInferenceAdapter):
-    """Test that response content is within reasonable bounds.
-
-    Validates:
-    - Content is not empty
-    - Content is not excessively large (< 50KB)
-    - References are capped at reasonable limits
-    """
-    import asyncio
-
-    result = await asyncio.to_thread(
-        inference_service.generate,
-        user_id="test-user-bounds",
-        message="Provide a comprehensive overview of the product's design capabilities.",
-    )
-
-    # Content bounds
-    content = result["content"]
-    assert len(content) > 0, "Content should not be empty"
-    assert len(content) <= 50000, "Content should be under 50KB"
-
-    # Reference bounds (service should cap at MAX_RESULTS=100)
-    refs = result["references"]
-    assert len(refs) <= 100, "References should be capped at 100"
-
-    print(f"\n✓ Response within bounds: {len(content)} chars, {len(refs)} refs")
-
-
-class TestInferenceServiceRetrievalIntegration:
-    """Integration tests specifically for the retrieval tool functionality."""
-
-    @pytest.mark.no_seed_tags
-    @pytest.mark.anyio
-    async def test_retrieval_tool_is_called_for_search_query(
-        self, inference_service: GTCInferenceAdapter
-    ):
-        """Test that asking a question triggers the retrieval tool.
-
-        We can't directly verify the tool was called, but we can check:
-        1. Response mentions relevant technical content
-        2. References are returned (if retrieval worked)
-        """
-        import asyncio
-
-        result = await asyncio.to_thread(
-            inference_service.generate,
-            user_id="test-retrieval",
-            message="Search the documentation for information about CAD file formats supported by this product.",
-        )
-
-        assert "content" in result
-        # The response should mention something technical if retrieval worked
-        content_lower = result["content"].lower()
-
-        # Log what we got
-        print(f"\n✓ Response: {result['content'][:300]}...")
-        print(f"  References: {len(result['references'])}")
-
-        # If references are returned, the retrieval tool was used
-        if result["references"]:
-            print("✓ Retrieval tool was invoked and returned results")
-            # Validate reference structure
-            for ref in result["references"]:
-                assert isinstance(ref, dict), "Each reference should be a dict"
-                # Should have mapped fields
-                if ref.get("snippet"):
-                    assert isinstance(ref["snippet"], str)
-                    # Snippet should be truncated to MAX_STRING_LENGTH (1000)
-                    assert len(ref["snippet"]) <= 1003  # 1000 + "..."
-
-    @pytest.mark.no_seed_tags
-    @pytest.mark.anyio
-    async def test_references_have_expected_structure(self, inference_service: GTCInferenceAdapter):
-        """Test that returned references have the expected structure.
-
-        Expected fields from extract_references_from_output:
-        - id: from chunk_id or id
-        - title: document title
-        - url: document URL
-        - snippet: truncated content
-        """
-        import asyncio
-
-        result = await asyncio.to_thread(
-            inference_service.generate,
-            user_id="test-ref-structure",
-            message="What are the system requirements for installing this product?",
-        )
-
-        if not result["references"]:
-            pytest.skip("No references returned; cannot validate structure")
-
-        print(f"\n✓ Validating {len(result['references'])} reference(s)")
-
-        for i, ref in enumerate(result["references"]):
-            assert isinstance(ref, dict), f"Reference {i} should be a dict"
-
-            # Should have some identifier
-            has_id = ref.get("id") is not None
-            has_url = ref.get("url") is not None
-            assert has_id or has_url, f"Reference {i} should have id or url"
-
-            # Log reference details
-            print(
-                f"  Ref {i + 1}: id={ref.get('id')}, title={ref.get('title')}, url={ref.get('url')}"
-            )
diff --git a/backend/tests/test_helpers.py b/backend/tests/test_helpers.py
new file mode 100644
index 0000000..d725ab7
--- /dev/null
+++ b/backend/tests/test_helpers.py
@@ -0,0 +1,182 @@
+"""Test helpers for creating AgenticGroundTruthEntry fixtures.
+
+After Phase 6: canonical state is history[]; question/answer/refs are derived
+from history or stored in plugins["rag-compat"].
+"""
+
+from __future__ import annotations
+
+from datetime import datetime
+from typing import Any
+
+from app.domain.enums import GroundTruthStatus
+from app.domain.models import AgenticGroundTruthEntry
+
+
+def make_test_entry(
+    *,
+    id: str = "test-item",
+    dataset_name: str = "test-dataset",
+    status: GroundTruthStatus = GroundTruthStatus.draft,
+    history: list[dict[str, Any]] | None = None,
+    synth_question: str | None = None,
+    edited_question: str | None = None,
+    answer: str | None = None,
+    refs: list[dict[str, Any]] | None = None,
+    manual_tags: list[str] | None = None,
+    comment: str = "",
+    reviewed_at: datetime | None = None,
+    **kwargs: Any,
+) -> AgenticGroundTruthEntry:
+    """Create a test entry with canonical history-based construction.
+
+    Args:
+        id: Item ID (default: "test-item")
+        dataset_name: Dataset name (default: "test-dataset")
+        status: Item status (default: draft)
+        history: Explicit history array. If None and synth_question/answer provided,
+                 a simple Q&A history will be auto-generated.
+        synth_question: Question text (stored in rag-compat plugin)
+        edited_question: Edited question text (stored in rag-compat plugin)
+        answer: Answer text (stored in rag-compat plugin)
+        refs: References (stored in rag-compat plugin)
+        manual_tags: Manual tags list
+        comment: Item comment
+        reviewed_at: Review timestamp
+        **kwargs: Additional fields to pass to AgenticGroundTruthEntry
+
+    Returns:
+        AgenticGroundTruthEntry: A properly constructed test entry
+
+    Examples:
+        # Simple Q&A entry:
+        entry = make_test_entry(
+            id="item-1",
+            synth_question="What is X?",
+            answer="X is Y"
+        )
+
+        # Entry with explicit history:
+        entry = make_test_entry(
+            id="item-2",
+            history=[
+                {"role": "user", "msg": "Hello"},
+                {"role": "assistant", "msg": "Hi there"}
+            ]
+        )
+
+        # Entry with refs in rag-compat plugin:
+        entry = make_test_entry(
+            id="item-3",
+            synth_question="What?",
+            answer="Answer",
+            refs=[{"url": "https://example.com", "title": "Example"}]
+        )
+    """
+    # Build base payload
+    payload: dict[str, Any] = {
+        "id": id,
+        "datasetName": dataset_name,
+        "status": status.value if isinstance(status, GroundTruthStatus) else status,
+        "manualTags": manual_tags or [],
+        "comment": comment,
+        **kwargs,
+    }
+
+    if reviewed_at is not None:
+        payload["reviewedAt"] = (
+            reviewed_at.isoformat() if isinstance(reviewed_at, datetime) else reviewed_at
+        )
+
+    # Handle history construction
+    if history is not None:
+        # Use explicit history
+        payload["history"] = history
+    elif synth_question or answer:
+        # Auto-generate simple Q&A history from legacy-style params
+        auto_history: list[dict[str, Any]] = []
+        if synth_question:
+            auto_history.append({"role": "user", "msg": synth_question})
+        if answer:
+            auto_history.append({"role": "assistant", "msg": answer})
+        payload["history"] = auto_history
+
+    # Build rag-compat plugin data if any legacy fields are provided
+    rag_compat_data: dict[str, Any] = {}
+    if synth_question is not None:
+        rag_compat_data["synthQuestion"] = synth_question
+    if edited_question is not None:
+        rag_compat_data["editedQuestion"] = edited_question
+    if answer is not None:
+        rag_compat_data["answer"] = answer
+    if refs is not None:
+        rag_compat_data["refs"] = refs
+
+    if rag_compat_data:
+        payload["plugins"] = {
+            "rag-compat": {
+                "kind": "rag-compat",
+                "version": "1.0",
+                "data": rag_compat_data,
+            }
+        }
+
+    # Validate and return
+    return AgenticGroundTruthEntry.model_validate(payload)
+
+
+def make_simple_qa_entry(
+    question: str,
+    answer: str,
+    *,
+    id: str = "test-item",
+    dataset_name: str = "test-dataset",
+    refs: list[dict[str, Any]] | None = None,
+    **kwargs: Any,
+) -> AgenticGroundTruthEntry:
+    """Create a simple Q&A entry (convenience wrapper).
+
+    Args:
+        question: The question text
+        answer: The answer text
+        id: Item ID
+        dataset_name: Dataset name
+        refs: Optional references
+        **kwargs: Additional fields
+
+    Returns:
+        AgenticGroundTruthEntry: A test entry with Q&A history
+    """
+    return make_test_entry(
+        id=id,
+        dataset_name=dataset_name,
+        synth_question=question,
+        answer=answer,
+        refs=refs,
+        **kwargs,
+    )
+
+
+def make_history_entry(
+    role: str,
+    msg: str,
+    refs: list[dict[str, Any]] | None = None,
+    expected_behavior: list[str] | None = None,
+) -> dict[str, Any]:
+    """Create a single history entry dict for use in history arrays.
+
+    Args:
+        role: Role (e.g., "user", "assistant")
+        msg: Message content
+        refs: Optional references for this turn
+        expected_behavior: Optional expected behavior annotations
+
+    Returns:
+        dict: A history entry dict ready for inclusion in entry.history
+    """
+    entry: dict[str, Any] = {"role": role, "msg": msg}
+    if refs is not None:
+        entry["refs"] = refs
+    if expected_behavior is not None:
+        entry["expectedBehavior"] = expected_behavior
+    return entry
diff --git a/backend/tests/unit/conftest.py b/backend/tests/unit/conftest.py
index e4313cd..1c8d123 100644
--- a/backend/tests/unit/conftest.py
+++ b/backend/tests/unit/conftest.py
@@ -12,7 +12,6 @@
 from app.services.search_service import SearchService
 from app.services.curation_service import CurationService
 from app.services.tag_registry_service import TagRegistryService
-from app.services.chat_service import ChatService
 
 
 # Use pytest-asyncio "auto mode"/anyio; legacy markers in tests use anyio/anyio_backend
@@ -33,8 +32,6 @@ def configure_unit_test_settings():
     # Save original values
     orig_ezauth = settings.EZAUTH_ENABLED
     orig_auth_mode = settings.AUTH_MODE
-    orig_chat_enabled = settings.CHAT_ENABLED
-    orig_store_steps = settings.STORE_AGENT_STEPS
     orig_anon_paths = settings.EZAUTH_ALLOW_ANONYMOUS_PATHS
     orig_allowed_domains = settings.EZAUTH_ALLOWED_EMAIL_DOMAINS
     orig_allowed_object_ids = settings.EZAUTH_ALLOWED_OBJECT_IDS
@@ -43,8 +40,6 @@ def configure_unit_test_settings():
     settings.EZAUTH_ENABLED = False
     settings.AUTH_MODE = "dev"
     settings.EZAUTH_ALLOW_ANONYMOUS_PATHS = "/healthz"
-    settings.CHAT_ENABLED = True
-    settings.STORE_AGENT_STEPS = False
     settings.EZAUTH_ALLOWED_EMAIL_DOMAINS = None
     settings.EZAUTH_ALLOWED_OBJECT_IDS = None
 
@@ -53,8 +48,6 @@ def configure_unit_test_settings():
     # Restore on session teardown
     settings.EZAUTH_ENABLED = orig_ezauth
     settings.AUTH_MODE = orig_auth_mode
-    settings.CHAT_ENABLED = orig_chat_enabled
-    settings.STORE_AGENT_STEPS = orig_store_steps
     settings.EZAUTH_ALLOW_ANONYMOUS_PATHS = orig_anon_paths
     settings.EZAUTH_ALLOWED_EMAIL_DOMAINS = orig_allowed_domains
     settings.EZAUTH_ALLOWED_OBJECT_IDS = orig_allowed_object_ids
@@ -180,16 +173,11 @@ async def upsert_remove(self, tags_to_remove):
         processor_registry=container.export_processor_registry,
         formatter_registry=container.export_formatter_registry,
         default_processor_order=container.export_default_processor_order,
+        plugin_export_transforms=container.plugin_pack_registry.collect_export_transforms(),
     )
     container.search_service = SearchService()
     container.curation_service = CurationService(container.repo)
     container.tag_registry_service = TagRegistryService(_InMemoryTagsRepo())
-    container.inference_service = None  # No real agent in unit tests
-    container.chat_service = ChatService(
-        inference_service=None,
-        steps_store=None,
-        store_steps=False,
-    )
     # Import LifespanManager lazily so tests can still run without the
     # optional dev dependency installed. If missing, we yield the app and
     # rely on FastAPI's lifespan being a no-op (it will still run, but
@@ -252,26 +240,6 @@ def clear_db():
     yield
 
 
-@pytest.fixture(autouse=True)
-def reset_chat_state():
-    orig_service = container.chat_service
-    orig_inference = container.inference_service
-    orig_store = container.agent_steps_store
-    chat_enabled = settings.CHAT_ENABLED
-    store_steps = settings.STORE_AGENT_STEPS
-    yield
-    container.chat_service = orig_service
-    container.inference_service = orig_inference
-    container.agent_steps_store = orig_store
-    settings.CHAT_ENABLED = chat_enabled
-    settings.STORE_AGENT_STEPS = store_steps
-    try:
-        container.chat_service.set_store_steps(settings.STORE_AGENT_STEPS)
-        container.chat_service.set_steps_store(container.agent_steps_store)
-    except Exception:
-        pass
-
-
 # Optional: use uvloop for faster event loop if installed. This is safe to
 # execute — if uvloop is missing we silently skip it.
 try:
diff --git a/backend/tests/unit/plugins/test_plugin_dataset.py b/backend/tests/unit/plugins/test_plugin_dataset.py
index 5cfb05f..b07a0c1 100644
--- a/backend/tests/unit/plugins/test_plugin_dataset.py
+++ b/backend/tests/unit/plugins/test_plugin_dataset.py
@@ -4,7 +4,7 @@
 
 import pytest
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.plugins.computed_tags.dataset import DatasetPlugin
 
 
@@ -30,7 +30,7 @@ def test_tag_key_is_dynamic_placeholder(self):
     def test_compute_returns_dataset_prefixed_tag(self, dataset_name, expected_tag):
         """compute() returns 'dataset:' prefix with the dataset name."""
         plugin = DatasetPlugin()
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-id",
             datasetName=dataset_name,
             synthQuestion="Question",
diff --git a/backend/tests/unit/plugins/test_plugin_no_answer.py b/backend/tests/unit/plugins/test_plugin_no_answer.py
index 31d458a..3b561db 100644
--- a/backend/tests/unit/plugins/test_plugin_no_answer.py
+++ b/backend/tests/unit/plugins/test_plugin_no_answer.py
@@ -2,7 +2,7 @@
 
 from __future__ import annotations
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.plugins.computed_tags.no_answer import NoAnswerPlugin
 
 
@@ -12,13 +12,15 @@ class TestNoAnswerPlugin:
     def test_no_answer_exact_match(self):
         """Should return tag when answer is exactly NO_ANSWER."""
         plugin = NoAnswerPlugin()
-        item = GroundTruthItem(id="test", datasetName="test", synthQuestion="Q", answer="NO_ANSWER")
+        item = AgenticGroundTruthEntry(
+            id="test", datasetName="test", synthQuestion="Q", answer="NO_ANSWER"
+        )
         assert plugin.compute(item) == "answer:no_answer"
 
     def test_no_answer_with_whitespace(self):
         """Should return tag when answer is NO_ANSWER with surrounding whitespace."""
         plugin = NoAnswerPlugin()
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test", datasetName="test", synthQuestion="Q", answer="  NO_ANSWER  "
         )
         assert plugin.compute(item) == "answer:no_answer"
@@ -26,7 +28,7 @@ def test_no_answer_with_whitespace(self):
     def test_no_answer_with_newlines(self):
         """Should return tag when answer is NO_ANSWER with newlines."""
         plugin = NoAnswerPlugin()
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test", datasetName="test", synthQuestion="Q", answer="\nNO_ANSWER\n"
         )
         assert plugin.compute(item) == "answer:no_answer"
@@ -34,7 +36,7 @@ def test_no_answer_with_newlines(self):
     def test_regular_answer_returns_none(self):
         """Should return None for regular answers."""
         plugin = NoAnswerPlugin()
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test", datasetName="test", synthQuestion="Q", answer="A valid answer"
         )
         assert plugin.compute(item) is None
@@ -42,11 +44,13 @@ def test_regular_answer_returns_none(self):
     def test_none_answer_returns_none(self):
         """Should return None when answer is None."""
         plugin = NoAnswerPlugin()
-        item = GroundTruthItem(id="test", datasetName="test", synthQuestion="Q", answer=None)
+        item = AgenticGroundTruthEntry(
+            id="test", datasetName="test", synthQuestion="Q", answer=None
+        )
         assert plugin.compute(item) is None
 
     def test_empty_answer_returns_none(self):
         """Should return None when answer is empty string."""
         plugin = NoAnswerPlugin()
-        item = GroundTruthItem(id="test", datasetName="test", synthQuestion="Q", answer="")
+        item = AgenticGroundTruthEntry(id="test", datasetName="test", synthQuestion="Q", answer="")
         assert plugin.compute(item) is None
diff --git a/backend/tests/unit/plugins/test_plugin_question_length.py b/backend/tests/unit/plugins/test_plugin_question_length.py
index 32eb253..d740fb9 100644
--- a/backend/tests/unit/plugins/test_plugin_question_length.py
+++ b/backend/tests/unit/plugins/test_plugin_question_length.py
@@ -4,7 +4,7 @@
 
 import pytest
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.plugins.computed_tags.question_length import (
     QuestionLengthLongPlugin,
     QuestionLengthMediumPlugin,
@@ -31,7 +31,7 @@ def test_mutually_exclusive_classification(
     ):
         """Each document gets exactly one length tag."""
         question = " ".join([f"word{i}" for i in range(word_count)])
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-id",
             datasetName="test-dataset",
             synthQuestion=question,
@@ -55,7 +55,7 @@ def test_mutually_exclusive_classification(
 
     def test_edited_question_takes_precedence(self):
         """editedQuestion is used over synthQuestion when present."""
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-id",
             datasetName="test-dataset",
             synthQuestion="short",  # 1 word
diff --git a/backend/tests/unit/plugins/test_plugin_reference_type.py b/backend/tests/unit/plugins/test_plugin_reference_type.py
index 540d3eb..8d7dfd8 100644
--- a/backend/tests/unit/plugins/test_plugin_reference_type.py
+++ b/backend/tests/unit/plugins/test_plugin_reference_type.py
@@ -4,7 +4,7 @@
 
 import pytest
 
-from app.domain.models import GroundTruthItem, Reference
+from app.domain.models import AgenticGroundTruthEntry, Reference
 from app.plugins.computed_tags.reference_type import (
     ReferenceTypeArticlePlugin,
     ReferenceTypeHelpcenterPlugin,
@@ -48,7 +48,7 @@ class TestReferenceTypePlugins:
 
     def test_no_refs_gets_no_tags(self):
         """Item with no refs should get neither tag."""
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-no-refs",
             datasetName="test-dataset",
             synthQuestion="Question",
@@ -58,7 +58,7 @@ def test_no_refs_gets_no_tags(self):
 
     def test_item_can_have_both_tags(self):
         """Item with both reference types should get both tags."""
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-both",
             datasetName="test-dataset",
             synthQuestion="Question",
@@ -72,7 +72,7 @@ def test_item_can_have_both_tags(self):
 
     def test_type_field_is_ignored(self):
         """Only URL matters, not the type field on Reference."""
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-type-ignored",
             datasetName="test-dataset",
             synthQuestion="Question",
diff --git a/backend/tests/unit/plugins/test_plugin_retrieval_behavior.py b/backend/tests/unit/plugins/test_plugin_retrieval_behavior.py
index 6b57167..3439b15 100644
--- a/backend/tests/unit/plugins/test_plugin_retrieval_behavior.py
+++ b/backend/tests/unit/plugins/test_plugin_retrieval_behavior.py
@@ -4,7 +4,7 @@
 
 import pytest
 
-from app.domain.models import GroundTruthItem, Reference, HistoryItem
+from app.domain.models import AgenticGroundTruthEntry, Reference, HistoryItem
 from app.domain.enums import HistoryItemRole
 from app.plugins.computed_tags.retrieval_behavior import (
     RetrievalBehaviorNoRefsPlugin,
@@ -30,7 +30,7 @@ class TestRetrievalBehaviorPlugins:
     )
     def test_mutually_exclusive_classification(self, num_refs, expected_tag):
         """Each document gets exactly one retrieval behavior tag."""
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id=f"test-{num_refs}-refs",
             datasetName="test-dataset",
             synthQuestion="Question",
@@ -52,7 +52,7 @@ def test_mutually_exclusive_classification(self, num_refs, expected_tag):
 
     def test_refs_in_history_are_counted(self):
         """References in history turns are included in the count."""
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-history-refs",
             datasetName="test-dataset",
             synthQuestion="Follow up question",
diff --git a/backend/tests/unit/plugins/test_plugin_turns.py b/backend/tests/unit/plugins/test_plugin_turns.py
index 98b9fdc..c09d372 100644
--- a/backend/tests/unit/plugins/test_plugin_turns.py
+++ b/backend/tests/unit/plugins/test_plugin_turns.py
@@ -4,7 +4,7 @@
 
 import pytest
 
-from app.domain.models import GroundTruthItem, HistoryItem
+from app.domain.models import AgenticGroundTruthEntry, HistoryItem
 from app.domain.enums import HistoryItemRole
 from app.plugins.computed_tags.turns import MultiTurnPlugin, SingleTurnPlugin
 
@@ -36,7 +36,7 @@ def test_mutually_exclusive_classification(self, history_len, expected_single, e
             else None
         )
 
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="test-id",
             datasetName="test-dataset",
             synthQuestion="Question",
diff --git a/backend/tests/unit/test_assignments_skip_persist.py b/backend/tests/unit/test_assignments_skip_persist.py
index cf4feec..3294aa4 100644
--- a/backend/tests/unit/test_assignments_skip_persist.py
+++ b/backend/tests/unit/test_assignments_skip_persist.py
@@ -5,18 +5,18 @@
 from datetime import datetime, timezone
 
 from app.container import container
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.domain.enums import GroundTruthStatus
 
 
 class _InMemoryRepo:
     def __init__(self):
         # Tuple key: datasetName, bucket_str, id
-        self.items: dict[tuple[str, str, str], GroundTruthItem] = {}
+        self.items: dict[tuple[str, str, str], AgenticGroundTruthEntry] = {}
 
     # ---- GroundTruthRepo protocol (minimal working set for this test) ----
     async def import_bulk_gt(
-        self, items: list[GroundTruthItem], buckets: int | None = None
+        self, items: list[AgenticGroundTruthEntry], buckets: int | None = None
     ):  # pragma: no cover
         raise NotImplementedError
 
@@ -30,7 +30,7 @@ async def get_gt(self, dataset: str, bucket: UUID, item_id: str):  # type: ignor
         key = (dataset, str(bucket), item_id)
         return self.items.get(key)
 
-    async def upsert_gt(self, item: GroundTruthItem):  # type: ignore[override]
+    async def upsert_gt(self, item: AgenticGroundTruthEntry):  # type: ignore[override]
         # Simulate ETag update on write
         item.etag = (item.etag or "etag") + ":updated"
         key = (item.datasetName, str(item.bucket), item.id)
@@ -51,17 +51,17 @@ async def list_unassigned(self, limit: int):  # pragma: no cover
 
     async def query_unassigned_by_dataset_prefix(
         self, dataset_prefix: str, user_id: str, take: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]:  # pragma: no cover
+    ) -> list[AgenticGroundTruthEntry]:  # pragma: no cover
         raise NotImplementedError
 
     async def query_unassigned_global(
         self, user_id: str, take: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]:  # pragma: no cover
+    ) -> list[AgenticGroundTruthEntry]:  # pragma: no cover
         raise NotImplementedError
 
     async def sample_unassigned(
         self, user_id: str, limit: int, exclude_ids: list[str] | None = None
-    ) -> list[GroundTruthItem]:  # pragma: no cover
+    ) -> list[AgenticGroundTruthEntry]:  # pragma: no cover
         raise NotImplementedError
 
     async def list_gt_paginated(
@@ -87,7 +87,9 @@ async def assign_to(self, item_id: str, user_id: str):  # pragma: no cover
     async def list_assigned(self, user_id: str):  # pragma: no cover
         raise NotImplementedError
 
-    async def upsert_assignment_doc(self, user_id: str, gt: GroundTruthItem):  # pragma: no cover
+    async def upsert_assignment_doc(
+        self, user_id: str, gt: AgenticGroundTruthEntry
+    ):  # pragma: no cover
         raise NotImplementedError
 
     async def list_assignments_by_user(self, user_id: str):  # pragma: no cover
@@ -123,7 +125,7 @@ async def test_status_skipped_keeps_assignment(async_client, user_headers):
     repo = _InMemoryRepo()
     container.repo = repo
     try:
-        gt = GroundTruthItem(
+        gt = AgenticGroundTruthEntry(
             id=item_id,
             datasetName=dataset,
             bucket=bucket,
diff --git a/backend/tests/unit/test_bulk_import_tag_validation.py b/backend/tests/unit/test_bulk_import_tag_validation.py
index d0957df..a716976 100644
--- a/backend/tests/unit/test_bulk_import_tag_validation.py
+++ b/backend/tests/unit/test_bulk_import_tag_validation.py
@@ -1,6 +1,6 @@
 import pytest
 from unittest.mock import AsyncMock, MagicMock, patch
-from app.domain.models import GroundTruthItem, BulkImportResult
+from app.domain.models import AgenticGroundTruthEntry, BulkImportResult
 from app.core.auth import UserContext
 
 
@@ -31,7 +31,7 @@ async def test_bulk_import_validates_tags(mock_container, mock_user):
     )
 
     items = [
-        GroundTruthItem(
+        AgenticGroundTruthEntry(
             id="test-1",
             datasetName="test",
             synthQuestion="What is Q?",
@@ -65,7 +65,7 @@ async def test_bulk_import_rejects_invalid_tags(mock_container, mock_user):
     )
 
     items = [
-        GroundTruthItem(
+        AgenticGroundTruthEntry(
             id="test-1", datasetName="test", synthQuestion="What is Q?", manualTags=["invalid:tag"]
         )
     ]
@@ -102,13 +102,13 @@ async def test_bulk_import_mixed_valid_invalid_tags(mock_container, mock_user):
     )
 
     items = [
-        GroundTruthItem(
+        AgenticGroundTruthEntry(
             id="test-1",
             datasetName="test",
             synthQuestion="Q1?",
             manualTags=["source:synthetic"],  # valid
         ),
-        GroundTruthItem(
+        AgenticGroundTruthEntry(
             id="test-2",
             datasetName="test",
             synthQuestion="Q2?",
@@ -149,7 +149,7 @@ async def test_bulk_import_no_tags(mock_container, mock_user):
     )
 
     items = [
-        GroundTruthItem(
+        AgenticGroundTruthEntry(
             id="test-1",
             datasetName="test",
             synthQuestion="What is Q?",
@@ -183,7 +183,7 @@ async def test_bulk_import_tag_validation_single_registry_fetch(mock_container,
     )
 
     items = [
-        GroundTruthItem(
+        AgenticGroundTruthEntry(
             id=f"test-{i}",
             datasetName="test",
             synthQuestion=f"Q{i}?",
diff --git a/backend/tests/unit/test_chat_endpoint.py b/backend/tests/unit/test_chat_endpoint.py
deleted file mode 100644
index e8428ba..0000000
--- a/backend/tests/unit/test_chat_endpoint.py
+++ /dev/null
@@ -1,53 +0,0 @@
-from __future__ import annotations
-
-import pytest
-
-
-@pytest.mark.anyio
-async def test_chat_rejects_empty_message(async_client, user_headers):
-    res = await async_client.post(
-        "/v1/chat",
-        json={"message": "   "},
-        headers=user_headers,
-    )
-    assert res.status_code == 422
-
-
-@pytest.mark.anyio
-async def test_chat_rejects_suspicious_content(async_client, user_headers):
-    res = await async_client.post(
-        "/v1/chat",
-        json={"message": "<script>alert(1)</script>"},
-        headers=user_headers,
-    )
-    assert res.status_code == 422
-
-
-@pytest.mark.anyio
-async def test_chat_returns_expected_fields(async_client, user_headers):
-    # reset_chat_state fixture ensures CHAT_ENABLED is True by default
-    # and inference_service/chat_service are properly initialized
-    res = await async_client.post(
-        "/v1/chat",
-        json={"message": "Tell me something"},
-        headers=user_headers,
-    )
-    assert res.status_code == 200
-    body = res.json()
-    assert "content" in body
-    assert isinstance(body.get("references"), list)
-
-
-@pytest.mark.anyio
-async def test_chat_returns_503_when_disabled(async_client, user_headers, monkeypatch):
-    # Use monkeypatch to avoid polluting global state
-    from app.core.config import settings
-
-    monkeypatch.setattr(settings, "CHAT_ENABLED", False)
-
-    res = await async_client.post(
-        "/v1/chat",
-        json={"message": "anything"},
-        headers=user_headers,
-    )
-    assert res.status_code == 503
diff --git a/backend/tests/unit/test_chat_service.py b/backend/tests/unit/test_chat_service.py
deleted file mode 100644
index 25c03f5..0000000
--- a/backend/tests/unit/test_chat_service.py
+++ /dev/null
@@ -1,63 +0,0 @@
-from __future__ import annotations
-
-import pytest
-
-from app.services.chat_service import ChatService
-from app.adapters.agent_steps_store import AgentStepsStore
-
-
-class FakeInferenceService:
-    """Fake InferenceService that returns canned responses for testing."""
-
-    def __init__(self, response: dict) -> None:
-        self.response = response
-        self.calls: list[tuple[str, str]] = []
-
-    def generate(self, *, user_id: str, message: str) -> dict:
-        """Synchronous generate matching InferenceService interface."""
-        self.calls.append((user_id, message))
-        return self.response
-
-
-class RecordingStore(AgentStepsStore):
-    def __init__(self) -> None:
-        self.saved: list[dict[str, object]] = []
-
-    async def save(self, *, user_id: str, request: dict, response: dict) -> None:  # type: ignore[override]
-        self.saved.append({"user_id": user_id, "request": request, "response": response})
-
-
-@pytest.mark.anyio
-async def test_generate_response_mock_fallback():
-    """When inference_service is None, ChatService returns mock response."""
-    service = ChatService(inference_service=None)
-    result = await service.generate_response(user_id="u", message="hi", context=None)
-    assert "content" in result
-    assert isinstance(result["references"], list)
-
-
-@pytest.mark.anyio
-async def test_generate_response_passes_through_inference_response():
-    """ChatService passes through response from InferenceService."""
-    fake_response = {"content": "hello", "references": [{"id": "r1"}]}
-    inference = FakeInferenceService(fake_response)
-    service = ChatService(inference_service=inference)  # type: ignore[arg-type]
-
-    result = await service.generate_response(user_id="user", message="msg", context="ctx")
-    assert result == fake_response
-    assert inference.calls == [("user", "msg")]
-
-
-@pytest.mark.anyio
-async def test_generate_response_persists_steps_when_enabled():
-    """ChatService persists steps when store_steps is enabled."""
-    fake_response = {"content": "hello", "references": []}
-    inference = FakeInferenceService(fake_response)
-    store = RecordingStore()
-    service = ChatService(inference_service=inference, steps_store=store, store_steps=True)  # type: ignore[arg-type]
-
-    await service.generate_response(user_id="user", message="msg", context=None)
-    assert store.saved
-    saved = store.saved[0]
-    assert saved["user_id"] == "user"
-    assert saved["response"] == fake_response
diff --git a/backend/tests/unit/test_computed_tags_plugins.py b/backend/tests/unit/test_computed_tags_plugins.py
index 2617b46..630dcb0 100644
--- a/backend/tests/unit/test_computed_tags_plugins.py
+++ b/backend/tests/unit/test_computed_tags_plugins.py
@@ -4,7 +4,7 @@
 
 import pytest
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.plugins.base import ComputedTagPlugin, TagPluginRegistry
 from app.plugins.registry import (
     get_default_registry,
@@ -26,7 +26,7 @@ class TestTagPluginRegistry:
     def test_empty_registry_returns_empty_tags(self):
         """An empty registry should return no tags."""
         registry = TagPluginRegistry()
-        item = GroundTruthItem(id="test", datasetName="test", synthQuestion="Q")
+        item = AgenticGroundTruthEntry(id="test", datasetName="test", synthQuestion="Q")
         assert registry.compute_all(item) == []
         assert registry.get_all_keys() == set()
 
@@ -38,7 +38,7 @@ class Plugin1(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "dup:key"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return self.tag_key
 
         class Plugin2(ComputedTagPlugin):
@@ -46,7 +46,7 @@ class Plugin2(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "dup:key"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return self.tag_key
 
         registry = TagPluginRegistry()
@@ -80,7 +80,7 @@ class DynamicPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "dataset:_dynamic"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return f"dataset:{doc.datasetName}" if doc.datasetName else None
 
         registry = TagPluginRegistry()
@@ -96,7 +96,7 @@ class StaticPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "turns:multiturn"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return self.tag_key
 
         class DynamicPlugin(ComputedTagPlugin):
@@ -104,7 +104,7 @@ class DynamicPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "dataset:_dynamic"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return f"dataset:{doc.datasetName}" if doc.datasetName else None
 
         registry = TagPluginRegistry()
@@ -122,7 +122,7 @@ class StaticPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "turns:multiturn"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return self.tag_key
 
         registry = TagPluginRegistry()
@@ -139,7 +139,7 @@ class DynamicPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "dataset:_dynamic"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return f"dataset:{doc.datasetName}" if doc.datasetName else None
 
         registry = TagPluginRegistry()
@@ -160,7 +160,7 @@ class StaticPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "turns:multiturn"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return self.tag_key
 
         registry = TagPluginRegistry()
@@ -183,7 +183,7 @@ class DynamicPlugin(ComputedTagPlugin):
             def tag_key(self) -> str:
                 return "dataset:_dynamic"
 
-            def compute(self, doc: GroundTruthItem) -> str | None:
+            def compute(self, doc: AgenticGroundTruthEntry) -> str | None:
                 return f"dataset:{doc.datasetName}" if doc.datasetName else None
 
         registry = TagPluginRegistry()
@@ -204,7 +204,7 @@ class TestGroundTruthItemTagMerge:
 
     def test_computed_and_manual_tags_merge(self):
         """Verify computed and manual tags merge correctly and are sorted."""
-        item = GroundTruthItem(
+        item = AgenticGroundTruthEntry(
             id="merge-test",
             datasetName="test-dataset",
             synthQuestion="Test question",
diff --git a/backend/tests/unit/test_cosmos_repo.py b/backend/tests/unit/test_cosmos_repo.py
index c67bdd9..0647bf7 100644
--- a/backend/tests/unit/test_cosmos_repo.py
+++ b/backend/tests/unit/test_cosmos_repo.py
@@ -1,12 +1,14 @@
 from __future__ import annotations
 
 from datetime import datetime, timezone
+from unittest.mock import AsyncMock, MagicMock, patch
 
 import pytest  # type: ignore[import-not-found]
 
-from app.adapters.repos.cosmos_repo import CosmosGroundTruthRepo
+from app.adapters.repos.cosmos_repo import CosmosGroundTruthRepo, SELECT_CLAUSE_C
 from app.domain.enums import GroundTruthStatus, SortField, SortOrder
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
+from tests.test_helpers import make_test_entry
 
 
 @pytest.fixture()
@@ -103,49 +105,72 @@ def test_resolve_sort_with_overrides(repo: CosmosGroundTruthRepo) -> None:
 
 
 def test_sort_key_has_answer(repo: CosmosGroundTruthRepo) -> None:
-    example = GroundTruthItem.model_validate(
-        {
-            "id": "item",
-            "datasetName": "faq",
-            "synthQuestion": "What?",
-            "answer": "value",
-            "manualTags": ["team:sme"],
-            "reviewedAt": datetime(2024, 1, 1, tzinfo=timezone.utc).isoformat(),
-        }
+    example = make_test_entry(
+        id="item",
+        dataset_name="faq",
+        synth_question="What?",
+        answer="value",
+        manual_tags=["team:sme"],
+        reviewed_at=datetime(2024, 1, 1, tzinfo=timezone.utc),
     )
     key = CosmosGroundTruthRepo._sort_key(example, SortField.has_answer)
     assert key[0] == 1
 
 
+def test_select_clause_includes_generic_phase_one_fields() -> None:
+    for field in (
+        "c.scenarioId",
+        "c.contextEntries",
+        "c.traceIds",
+        "c.toolCalls",
+        "c.expectedTools",
+        "c.feedback",
+        "c.metadata",
+        "c.createdBy",
+        "c.createdAt",
+        "c.tracePayload",
+    ):
+        assert field in SELECT_CLAUSE_C
+
+
 # =============================================================================
 # Tests for totalReferences auto-computation (domain model validator)
 # =============================================================================
 
 
 class TestComputeTotalReferences:
-    """Unit tests for GroundTruthItem.compute_total_references_if_needed.
+    """Unit tests for AgenticGroundTruthEntry.totalReferences computation.
 
-    The method calculates total references with the following logic:
+    The property calculates total references with the following logic:
     - If history has refs, count only history refs (history takes priority)
-    - If history has no refs, count item-level refs as fallback
+    - If history has no refs, count plugin-stored refs as fallback
+
+    **Phase 5 Audit (2026-03-12)**: ACTIVE COMPUTATION LOGIC - BLOCKING
+    The totalReferences field has active property logic that computes
+    values from history and plugin refs. This is not just compatibility
+    testing - it's core functionality that is used by:
+    - Model validation on all item saves
+    - Sort/filter operations that check reference counts
+    - UI displays of reference totals
+
+    Cannot delete totalReferences until this computation is either:
+    - Moved to a computed property on AgenticGroundTruthEntry, OR
+    - Replaced by direct history ref counting in callers
     """
 
     def _make_item(
         self,
         refs: list[dict] | None = None,
         history: list[dict] | None = None,
-    ) -> GroundTruthItem:
-        """Helper to create a GroundTruthItem with specified refs and history."""
-        data: dict = {
-            "id": "test-item",
-            "datasetName": "test-dataset",
-            "synthQuestion": "Test question?",
-        }
-        if refs is not None:
-            data["refs"] = refs
-        if history is not None:
-            data["history"] = history
-        return GroundTruthItem.model_validate(data)
+    ) -> AgenticGroundTruthEntry:
+        """Helper to create an AgenticGroundTruthEntry with specified refs and history."""
+        return make_test_entry(
+            id="test-item",
+            dataset_name="test-dataset",
+            synth_question="Test question?",
+            refs=refs,
+            history=history,
+        )
 
     # -------------------------------------------------------------------------
     # History refs take priority over item refs
@@ -317,13 +342,23 @@ def test_many_refs_in_single_turn(self) -> None:
 
     def test_item_only_no_history_field_at_all(self) -> None:
         """Item created without history field entirely."""
-        data = {
-            "id": "minimal-item",
-            "datasetName": "test",
-            "synthQuestion": "What?",
-            "refs": [{"url": "https://only-ref.com"}],
-        }
-        item = GroundTruthItem.model_validate(data)
+        # Use model_validate directly to test the case where history is completely absent
+        item = AgenticGroundTruthEntry.model_validate(
+            {
+                "id": "minimal-item",
+                "datasetName": "test",
+                "plugins": {
+                    "rag-compat": {
+                        "kind": "rag-compat",
+                        "version": "1.0",
+                        "data": {
+                            "synthQuestion": "What?",
+                            "refs": [{"url": "https://only-ref.com"}],
+                        },
+                    }
+                },
+            }
+        )
         assert item.totalReferences == 1
 
     def test_complex_real_world_scenario(self) -> None:
@@ -359,3 +394,94 @@ def test_complex_real_world_scenario(self) -> None:
         )
         # History refs: 2 + 1 = 3 (item-level ref is ignored)
         assert item.totalReferences == 3
+
+
+# ---------------------------------------------------------------------------
+# IQ-001 regression: list_all_gt must filter to ground-truth-item docType
+# ---------------------------------------------------------------------------
+
+
+def test_list_all_gt_query_includes_doctype_filter(repo: CosmosGroundTruthRepo) -> None:
+    """list_all_gt must generate a query that excludes non-ground-truth documents."""
+    # Reach into the query logic by directly constructing what list_all_gt would build.
+    # The method builds: WHERE c.docType = 'ground-truth-item' [AND c.status = @status]
+    # Verify the WHERE clause string for the no-filter case.
+    clauses = ["c.docType = 'ground-truth-item'"]
+    where = " WHERE " + " AND ".join(clauses)
+    query = f"SELECT * FROM c{where}"
+    assert "c.docType = 'ground-truth-item'" in query
+    assert "SELECT * FROM c WHERE" in query
+
+
+def test_list_all_gt_query_with_status_filter(repo: CosmosGroundTruthRepo) -> None:
+    """list_all_gt with status must include BOTH docType and status filters."""
+
+    clauses = ["c.docType = 'ground-truth-item'", "c.status = @status"]
+    where = " WHERE " + " AND ".join(clauses)
+    query = f"SELECT * FROM c{where}"
+    assert "c.docType = 'ground-truth-item'" in query
+    assert "c.status = @status" in query
+    # Ensure both clauses are present (not SELECT * FROM c WHERE c.status = @status alone)
+    assert "c.docType = 'ground-truth-item' AND c.status = @status" in query
+
+
+# ---------------------------------------------------------------------------
+# IQ-003 strengthened: call list_all_gt() directly and assert query emitted
+# ---------------------------------------------------------------------------
+
+
+async def _empty_aiter():  # type: ignore[return]
+    """Empty async generator used to stub out Cosmos query_items in unit tests."""
+    return
+    yield  # pragma: no cover – presence makes this an async generator function
+
+
+@pytest.mark.asyncio
+async def test_list_all_gt_directly_emits_doctype_filter(
+    repo: CosmosGroundTruthRepo,
+) -> None:
+    """Calling list_all_gt() directly must pass the docType filter to query_items."""
+    captured: list[str] = []
+
+    def _mock_query_items(*args: object, **kwargs: object) -> object:
+        query = kwargs.get("query") or (args[0] if args else "")
+        captured.append(str(query))
+        return _empty_aiter()
+
+    mock_container = MagicMock()
+    mock_container.query_items = _mock_query_items
+
+    with patch.object(repo, "_ensure_initialized", new_callable=AsyncMock):
+        repo._gt_container = mock_container  # type: ignore[assignment]
+        result = await repo.list_all_gt()
+
+    assert result == []
+    assert len(captured) == 1
+    assert "c.docType = 'ground-truth-item'" in captured[0]
+    assert "SELECT * FROM c WHERE" in captured[0]
+
+
+@pytest.mark.asyncio
+async def test_list_all_gt_directly_emits_doctype_and_status_filter(
+    repo: CosmosGroundTruthRepo,
+) -> None:
+    """list_all_gt(status=draft) must emit BOTH docType and status clauses."""
+    captured: list[str] = []
+
+    def _mock_query_items(*args: object, **kwargs: object) -> object:
+        query = kwargs.get("query") or (args[0] if args else "")
+        captured.append(str(query))
+        return _empty_aiter()
+
+    mock_container = MagicMock()
+    mock_container.query_items = _mock_query_items
+
+    with patch.object(repo, "_ensure_initialized", new_callable=AsyncMock):
+        repo._gt_container = mock_container  # type: ignore[assignment]
+        result = await repo.list_all_gt(status=GroundTruthStatus.draft)
+
+    assert result == []
+    assert len(captured) == 1
+    assert "c.docType = 'ground-truth-item'" in captured[0]
+    assert "c.status = @status" in captured[0]
+    assert "c.docType = 'ground-truth-item' AND c.status = @status" in captured[0]
diff --git a/backend/tests/unit/test_demo_mode_memory_api.py b/backend/tests/unit/test_demo_mode_memory_api.py
new file mode 100644
index 0000000..abc0fa9
--- /dev/null
+++ b/backend/tests/unit/test_demo_mode_memory_api.py
@@ -0,0 +1,80 @@
+from __future__ import annotations
+
+from typing import Any
+
+import pytest
+from httpx import ASGITransport, AsyncClient
+
+from app.container import container
+from app.core.config import settings
+
+
+@pytest.mark.anyio
+async def test_demo_mode_seeds_memory_backend_for_api_usage() -> None:
+    from app.main import create_app
+
+    lifespan = pytest.importorskip("asgi_lifespan")
+    LifespanManager = lifespan.LifespanManager
+
+    original_settings = {
+        "REPO_BACKEND": settings.REPO_BACKEND,
+        "DEMO_MODE": settings.DEMO_MODE,
+        "DEMO_USER_ID": settings.DEMO_USER_ID,
+    }
+    original_container: dict[str, Any] = {
+        "repo": getattr(container, "repo", None),
+        "assignment_service": getattr(container, "assignment_service", None),
+        "search_service": getattr(container, "search_service", None),
+        "snapshot_service": getattr(container, "snapshot_service", None),
+        "curation_service": getattr(container, "curation_service", None),
+        "tag_registry_service": getattr(container, "tag_registry_service", None),
+        "tags_repo": getattr(container, "tags_repo", None),
+        "tag_definitions_repo": getattr(container, "tag_definitions_repo", None),
+    }
+
+    settings.REPO_BACKEND = "memory"
+    settings.DEMO_MODE = True
+    settings.DEMO_USER_ID = "anonymous"
+
+    container.repo = None
+
+    app = create_app()
+
+    try:
+        async with LifespanManager(app):
+            async with AsyncClient(
+                transport=ASGITransport(app=app),
+                base_url="http://testserver",
+            ) as client:
+                assignments = await client.get("/v1/assignments/my")
+                assert assignments.status_code == 200
+                assignment_items = assignments.json()
+                assert len(assignment_items) == 2
+                assert {item["id"] for item in assignment_items} == {
+                    "demo-data-overage",
+                    "demo-hotspot-weekend",
+                }
+
+                search = await client.get("/v1/search", params={"q": "data", "top": 5})
+                assert search.status_code == 200
+                assert search.json()["results"]
+
+                stats = await client.get("/v1/ground-truths/stats")
+                assert stats.status_code == 200
+                assert stats.json() == {"draft": 2, "approved": 1, "deleted": 1}
+
+                datasets = await client.get("/v1/datasets")
+                assert datasets.status_code == 200
+                assert set(datasets.json()) == {"customer-feedback", "network-diagnostics"}
+
+                instructions = await client.get(
+                    "/v1/datasets/customer-feedback/curation-instructions"
+                )
+                assert instructions.status_code == 200
+                assert "Customer Feedback Demo Instructions" in instructions.json()["instructions"]
+    finally:
+        settings.REPO_BACKEND = original_settings["REPO_BACKEND"]
+        settings.DEMO_MODE = original_settings["DEMO_MODE"]
+        settings.DEMO_USER_ID = original_settings["DEMO_USER_ID"]
+        for attr, value in original_container.items():
+            setattr(container, attr, value)
diff --git a/backend/tests/unit/test_duplicate_detection.py b/backend/tests/unit/test_duplicate_detection.py
index f46a275..696839f 100644
--- a/backend/tests/unit/test_duplicate_detection.py
+++ b/backend/tests/unit/test_duplicate_detection.py
@@ -8,8 +8,8 @@
     detect_duplicates_for_item,
     detect_duplicates_for_bulk_items,
 )
-from app.domain.models import GroundTruthItem
 from app.domain.enums import GroundTruthStatus
+from tests.test_helpers import make_test_entry
 
 
 def test_normalize_text_basic():
@@ -28,9 +28,9 @@ def test_normalize_text_none_and_empty():
 
 def test_get_question_text_edited_preferred():
     """Test that edited question is preferred over synth question."""
-    item = GroundTruthItem(
+    item = make_test_entry(
         id="test-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="Original question",
         edited_question="Edited question",
         status=GroundTruthStatus.draft,
@@ -40,9 +40,9 @@ def test_get_question_text_edited_preferred():
 
 def test_get_question_text_synth_fallback():
     """Test fallback to synth question when edited is missing."""
-    item = GroundTruthItem(
+    item = make_test_entry(
         id="test-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="Original question",
         status=GroundTruthStatus.draft,
     )
@@ -51,15 +51,15 @@ def test_get_question_text_synth_fallback():
 
 def test_items_are_duplicates_exact_question_match():
     """Test detection of exact question match."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is the capital of France?",
         status=GroundTruthStatus.draft,
     )
-    approved = GroundTruthItem(
+    approved = make_test_entry(
         id="approved-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="WHAT IS THE CAPITAL OF FRANCE?",  # Different case
         status=GroundTruthStatus.approved,
     )
@@ -71,16 +71,16 @@ def test_items_are_duplicates_exact_question_match():
 
 def test_items_are_duplicates_question_and_answer_match():
     """Test detection with both question and answer match."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is 2+2?",
         answer="The answer is 4",
         status=GroundTruthStatus.draft,
     )
-    approved = GroundTruthItem(
+    approved = make_test_entry(
         id="approved-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="what is 2+2?",
         answer="THE ANSWER IS 4",
         status=GroundTruthStatus.approved,
@@ -93,15 +93,15 @@ def test_items_are_duplicates_question_and_answer_match():
 
 def test_items_are_duplicates_whitespace_normalized():
     """Test that whitespace differences don't prevent match."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What   is\n\nthe    answer?",
         status=GroundTruthStatus.draft,
     )
-    approved = GroundTruthItem(
+    approved = make_test_entry(
         id="approved-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is the answer?",
         status=GroundTruthStatus.approved,
     )
@@ -113,15 +113,15 @@ def test_items_are_duplicates_whitespace_normalized():
 
 def test_items_are_not_duplicates_different_questions():
     """Test that different questions are not flagged."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is the capital of France?",
         status=GroundTruthStatus.draft,
     )
-    approved = GroundTruthItem(
+    approved = make_test_entry(
         id="approved-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is the capital of Germany?",
         status=GroundTruthStatus.approved,
     )
@@ -133,15 +133,15 @@ def test_items_are_not_duplicates_different_questions():
 
 def test_items_are_not_duplicates_missing_question():
     """Test that items without questions are not flagged."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="",
         status=GroundTruthStatus.draft,
     )
-    approved = GroundTruthItem(
+    approved = make_test_entry(
         id="approved-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is the answer?",
         status=GroundTruthStatus.approved,
     )
@@ -152,22 +152,22 @@ def test_items_are_not_duplicates_missing_question():
 
 def test_detect_duplicates_for_item_finds_match():
     """Test detecting duplicates for a single item."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is Python?",
         status=GroundTruthStatus.draft,
     )
     approved_items = [
-        GroundTruthItem(
+        make_test_entry(
             id="approved-1",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Java?",
             status=GroundTruthStatus.approved,
         ),
-        GroundTruthItem(
+        make_test_entry(
             id="approved-2",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Python?",
             status=GroundTruthStatus.approved,
         ),
@@ -182,16 +182,16 @@ def test_detect_duplicates_for_item_finds_match():
 
 def test_detect_duplicates_for_item_respects_max_results():
     """Test that max_results limit is enforced."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="Common question",
         status=GroundTruthStatus.draft,
     )
     approved_items = [
-        GroundTruthItem(
+        make_test_entry(
             id=f"approved-{i}",
-            datasetName="test",
+            dataset_name="test",
             synth_question="Common question",
             status=GroundTruthStatus.approved,
         )
@@ -204,22 +204,22 @@ def test_detect_duplicates_for_item_respects_max_results():
 
 def test_detect_duplicates_for_item_ignores_non_approved():
     """Test that non-approved items are ignored."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is the answer?",
         status=GroundTruthStatus.draft,
     )
     other_items = [
-        GroundTruthItem(
+        make_test_entry(
             id="draft-2",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is the answer?",
             status=GroundTruthStatus.draft,  # Not approved
         ),
-        GroundTruthItem(
+        make_test_entry(
             id="deleted-1",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is the answer?",
             status=GroundTruthStatus.deleted,  # Not approved
         ),
@@ -231,9 +231,9 @@ def test_detect_duplicates_for_item_ignores_non_approved():
 
 def test_detect_duplicates_for_item_ignores_self():
     """Test that an item is not flagged as duplicate of itself."""
-    item = GroundTruthItem(
+    item = make_test_entry(
         id="same-id",
-        datasetName="test",
+        dataset_name="test",
         synth_question="What is the answer?",
         status=GroundTruthStatus.approved,
     )
@@ -246,29 +246,29 @@ def test_detect_duplicates_for_item_ignores_self():
 def test_detect_duplicates_for_bulk_items():
     """Test bulk duplicate detection."""
     draft_items = [
-        GroundTruthItem(
+        make_test_entry(
             id="draft-1",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Python?",
             status=GroundTruthStatus.draft,
         ),
-        GroundTruthItem(
+        make_test_entry(
             id="draft-2",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Java?",
             status=GroundTruthStatus.draft,
         ),
     ]
     approved_items = [
-        GroundTruthItem(
+        make_test_entry(
             id="approved-1",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Python?",
             status=GroundTruthStatus.approved,
         ),
-        GroundTruthItem(
+        make_test_entry(
             id="approved-2",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is C++?",
             status=GroundTruthStatus.approved,
         ),
@@ -283,23 +283,23 @@ def test_detect_duplicates_for_bulk_items():
 def test_detect_duplicates_for_bulk_items_only_checks_drafts():
     """Test that only draft items are checked for duplicates."""
     items = [
-        GroundTruthItem(
+        make_test_entry(
             id="approved-new",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Python?",
             status=GroundTruthStatus.approved,  # Not a draft
         ),
-        GroundTruthItem(
+        make_test_entry(
             id="draft-1",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Java?",
             status=GroundTruthStatus.draft,
         ),
     ]
     approved_items = [
-        GroundTruthItem(
+        make_test_entry(
             id="approved-1",
-            datasetName="test",
+            dataset_name="test",
             synth_question="What is Python?",
             status=GroundTruthStatus.approved,
         ),
@@ -330,16 +330,16 @@ def test_duplicate_warning_model():
 
 def test_detect_duplicates_uses_edited_question():
     """Test that edited question is used when present."""
-    draft = GroundTruthItem(
+    draft = make_test_entry(
         id="draft-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="Original question",
         edited_question="What is the edited question?",
         status=GroundTruthStatus.draft,
     )
-    approved = GroundTruthItem(
+    approved = make_test_entry(
         id="approved-1",
-        datasetName="test",
+        dataset_name="test",
         synth_question="Different original",
         edited_question="What is the edited question?",
         status=GroundTruthStatus.approved,
@@ -348,3 +348,51 @@ def test_detect_duplicates_uses_edited_question():
     warnings = detect_duplicates_for_item(draft, [approved])
     assert len(warnings) == 1
     assert warnings[0].duplicate_id == "approved-1"
+
+
+def test_detect_duplicates_for_generic_history_match():
+    draft = make_test_entry(
+        id="draft-1",
+        dataset_name="test",
+        status=GroundTruthStatus.draft,
+        history=[
+            {"role": "user", "msg": "Summarize the incident"},
+            {"role": "assistant", "msg": "The service restarted automatically."},
+        ],
+    )
+    approved = make_test_entry(
+        id="approved-1",
+        dataset_name="test",
+        status=GroundTruthStatus.approved,
+        history=[
+            {"role": "user", "msg": "Summarize the incident"},
+            {"role": "assistant", "msg": "The service restarted automatically."},
+        ],
+    )
+
+    warnings = detect_duplicates_for_item(draft, [approved])
+    assert len(warnings) == 1
+    assert warnings[0].match_reason == "exact question and answer match"
+
+
+def test_detect_duplicates_for_generic_structured_fields_match():
+    draft = make_test_entry(
+        id="draft-1",
+        dataset_name="test",
+        status=GroundTruthStatus.draft,
+        contextEntries=[{"key": "customerEmail", "value": "alice@example.com"}],
+        toolCalls=[{"name": "lookup_customer", "response": {"ticket": "INC-42"}}],
+        tracePayload={"ticketSummary": "Needs escalation"},
+    )
+    approved = make_test_entry(
+        id="approved-1",
+        dataset_name="test",
+        status=GroundTruthStatus.approved,
+        contextEntries=[{"key": "customerEmail", "value": "alice@example.com"}],
+        toolCalls=[{"name": "lookup_customer", "response": {"ticket": "INC-42"}}],
+        tracePayload={"ticketSummary": "Needs escalation"},
+    )
+
+    warnings = detect_duplicates_for_item(draft, [approved])
+    assert len(warnings) == 1
+    assert warnings[0].match_reason == "exact generic fields match"
diff --git a/backend/tests/unit/test_export_registry.py b/backend/tests/unit/test_export_registry.py
index d7d9267..a1454b8 100644
--- a/backend/tests/unit/test_export_registry.py
+++ b/backend/tests/unit/test_export_registry.py
@@ -94,3 +94,20 @@ def test_resolve_chain_uses_default_order_when_missing() -> None:
     registry.register(OtherProcessor())
     resolved = registry.resolve_chain(None, ["merge_tags", "other"])
     assert [p.name for p in resolved] == ["merge_tags", "other"]
+
+
+def test_apply_transforms_runs_in_order() -> None:
+    registry = ExportProcessorRegistry()
+    docs = [{"id": "1", "tags": ["a"]}]
+    transforms = [
+        type("T1", (), {"transform": staticmethod(lambda doc: {**doc, "stage": 1})})(),
+        type(
+            "T2",
+            (),
+            {"transform": staticmethod(lambda doc: {**doc, "stage": doc["stage"] + 1})},
+        )(),
+    ]
+
+    transformed = registry.apply_transforms(docs, transforms)
+
+    assert transformed == [{"id": "1", "tags": ["a"], "stage": 2}]
diff --git a/backend/tests/unit/test_groundtruthitem_tags_validation.py b/backend/tests/unit/test_groundtruthitem_tags_validation.py
index f8932c8..a80b3ec 100644
--- a/backend/tests/unit/test_groundtruthitem_tags_validation.py
+++ b/backend/tests/unit/test_groundtruthitem_tags_validation.py
@@ -1,6 +1,6 @@
 import pytest
 
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 
 
 BASE = dict(id="id1", datasetName="ds", synthQuestion="What is this product?")
@@ -9,7 +9,7 @@
 def make_item(**overrides):
     data = {**BASE, **overrides}
     # Allow both field names and aliases; Pydantic config handles this
-    return GroundTruthItem(**data)
+    return AgenticGroundTruthEntry(**data)
 
 
 def test_model_accepts_valid_tag_set():
diff --git a/backend/tests/unit/test_inference_service.py b/backend/tests/unit/test_inference_service.py
deleted file mode 100644
index c79638e..0000000
--- a/backend/tests/unit/test_inference_service.py
+++ /dev/null
@@ -1,154 +0,0 @@
-"""Unit tests for the GTC inference adapter.
-
-The test-client implementation in app/adapters/inference/inference.py is treated
-as read-only/opaque. These tests focus on the supported shim layer exposed by
-app/adapters/gtc_inference_adapter.py.
-"""
-
-from __future__ import annotations
-
-from unittest.mock import MagicMock, patch
-
-import pytest
-
-from app.adapters.gtc_inference_adapter import (
-    GTCInferenceAdapter,
-    MAX_RESULTS,
-    MAX_STRING_LENGTH,
-)
-
-
-class TestExtractReferences:
-    def test_maps_fields_and_snippet(self):
-        adapter = _make_adapter()
-        calls = [
-            {
-                "results": [
-                    {
-                        "chunk_id": "doc-1",
-                        "title": "Title 1",
-                        "url": "https://a.com",
-                        "content": "Content 1",
-                    }
-                ]
-            }
-        ]
-
-        refs = adapter._extract_references(calls)
-        assert refs == [
-            {
-                "id": "doc-1",
-                "title": "Title 1",
-                "url": "https://a.com",
-                "snippet": "Content 1",
-            }
-        ]
-
-    def test_falls_back_to_id_when_chunk_id_missing(self):
-        adapter = _make_adapter()
-        calls = [
-            {"results": [{"id": "fallback", "title": "T", "url": "https://x", "content": "C"}]}
-        ]
-        refs = adapter._extract_references(calls)
-        assert refs[0]["id"] == "fallback"
-
-    def test_skips_calls_with_error(self):
-        adapter = _make_adapter()
-        calls = [
-            {
-                "error": "boom",
-                "results": [{"chunk_id": "1", "title": "T", "url": "u", "content": "c"}],
-            },
-            {"results": [{"chunk_id": "2", "title": "T2", "url": "u2", "content": "c2"}]},
-        ]
-        refs = adapter._extract_references(calls)
-        assert len(refs) == 1
-        assert refs[0]["id"] == "2"
-
-    def test_truncates_snippet(self):
-        adapter = _make_adapter()
-        calls = [
-            {
-                "results": [
-                    {
-                        "chunk_id": "1",
-                        "title": "T",
-                        "url": "u",
-                        "content": "x" * (MAX_STRING_LENGTH + 5),
-                    }
-                ]
-            }
-        ]
-        refs = adapter._extract_references(calls)
-        assert refs[0]["snippet"].endswith("...")
-        assert len(refs[0]["snippet"]) == MAX_STRING_LENGTH + 3
-
-    def test_caps_total_references(self):
-        adapter = _make_adapter()
-        calls = [
-            {
-                "results": [
-                    {
-                        "chunk_id": f"doc-{i}",
-                        "title": f"T{i}",
-                        "url": f"https://{i}",
-                        "content": "C",
-                    }
-                    for i in range(MAX_RESULTS + 10)
-                ]
-            }
-        ]
-        refs = adapter._extract_references(calls)
-        assert len(refs) == MAX_RESULTS
-
-
-class TestGenerate:
-    def test_empty_message_raises(self):
-        adapter = _make_adapter()
-        with pytest.raises(ValueError, match="message cannot be empty"):
-            adapter.generate(user_id="u", message="   ")
-
-    def test_happy_path_returns_content_and_references(self):
-        fake_inference_service = MagicMock()
-        fake_inference_service.process_inference_request.return_value = {
-            "response_text": "hello",
-            "calls": [{"results": [{"chunk_id": "c1", "title": "t", "url": "u", "content": "s"}]}],
-        }
-
-        adapter = _make_adapter(inference_service=fake_inference_service)
-        result = adapter.generate(user_id="u", message="What is X?")
-        assert result["content"] == "hello"
-        assert result["references"][0]["id"] == "c1"
-
-    def test_empty_response_text_raises(self):
-        fake_inference_service = MagicMock()
-        fake_inference_service.process_inference_request.return_value = {
-            "response_text": "",
-            "calls": [],
-        }
-        adapter = _make_adapter(inference_service=fake_inference_service)
-        with pytest.raises(RuntimeError, match="empty response"):
-            adapter.generate(user_id="u", message="What is X?")
-
-    def test_inference_exception_wrapped(self):
-        fake_inference_service = MagicMock()
-        fake_inference_service.process_inference_request.side_effect = Exception("kaboom")
-        adapter = _make_adapter(inference_service=fake_inference_service)
-        with pytest.raises(RuntimeError, match="Agent request failed"):
-            adapter.generate(user_id="u", message="What is X?")
-
-
-def _make_adapter(*, inference_service: object | None = None) -> GTCInferenceAdapter:
-    """Create an adapter without hitting Azure SDK network calls."""
-    with patch("app.adapters.gtc_inference_adapter.DefaultAzureCredential"):
-        with patch("app.adapters.gtc_inference_adapter.InferenceService") as mock_inference_cls:
-            if inference_service is not None:
-                mock_inference_cls.return_value = inference_service
-
-            return GTCInferenceAdapter(
-                project_endpoint="https://project.example.com",
-                agent_id="agent-123",
-                retrieval_url="https://retrieval.example.com/search",
-                permissions_scope="api://retrieval/.default",
-                credential=MagicMock(),
-            )
diff --git a/backend/tests/unit/test_keyword_search.py b/backend/tests/unit/test_keyword_search.py
index 904e020..d636eee 100644
--- a/backend/tests/unit/test_keyword_search.py
+++ b/backend/tests/unit/test_keyword_search.py
@@ -1,7 +1,8 @@
 """Unit tests for keyword search functionality."""
 
-from app.domain.models import GroundTruthItem, HistoryItem
+from app.domain.models import HistoryItem
 from app.adapters.repos.cosmos_repo import CosmosGroundTruthRepo
+from tests.test_helpers import make_test_entry
 
 
 class TestKeywordMatching:
@@ -9,9 +10,9 @@ class TestKeywordMatching:
 
     def test_matches_synth_question(self):
         """Test matching keyword in synth_question field."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-1",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="What is machine learning?",
             answer=None,
@@ -23,9 +24,9 @@ def test_matches_synth_question(self):
 
     def test_matches_edited_question(self):
         """Test matching keyword in edited_question field."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-2",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="Original question",
             edited_question="What is deep learning?",
@@ -38,9 +39,9 @@ def test_matches_edited_question(self):
 
     def test_matches_answer(self):
         """Test matching keyword in answer field."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-3",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="Question",
             answer="Neural networks are a type of machine learning model",
@@ -52,9 +53,9 @@ def test_matches_answer(self):
 
     def test_matches_history_messages(self):
         """Test matching keyword in history turn messages."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-4",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="Question",
             answer="Answer",
@@ -70,9 +71,9 @@ def test_matches_history_messages(self):
 
     def test_empty_keyword_matches_all(self):
         """Test that empty keyword matches all items."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-5",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="Question",
             answer="Answer",
@@ -83,9 +84,9 @@ def test_empty_keyword_matches_all(self):
 
     def test_no_match_returns_false(self):
         """Test that non-matching keyword returns False."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-6",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="Question about cats",
             answer="Cats are animals",
@@ -96,9 +97,9 @@ def test_no_match_returns_false(self):
 
     def test_partial_match(self):
         """Test that partial word matching works (substring)."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-7",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="Question about networking",
             answer="Answer",
@@ -109,9 +110,9 @@ def test_partial_match(self):
 
     def test_handles_none_fields(self):
         """Test that None fields don't cause errors."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-8",
-            datasetName="test",
+            dataset_name="test",
             bucket="00000000-0000-0000-0000-000000000001",
             synth_question="Required field",
             edited_question=None,
diff --git a/backend/tests/unit/test_openapi.py b/backend/tests/unit/test_openapi.py
index a297590..7e47a63 100644
--- a/backend/tests/unit/test_openapi.py
+++ b/backend/tests/unit/test_openapi.py
@@ -45,3 +45,58 @@ async def test_get_specific_schema(async_client: AsyncClient):
     # Unknown should 404
     r2 = await async_client.get("/v1/schemas/DOES_NOT_EXIST")
     assert r2.status_code == 404
+
+
+@pytest.mark.anyio
+async def test_ground_truth_openapi_uses_agentic_schema(async_client: AsyncClient):
+    r = await async_client.get("/v1/openapi.json")
+    assert r.status_code == 200
+
+    data = r.json()
+    import_request = data["paths"]["/v1/ground-truths"]["post"]["requestBody"]["content"][
+        "application/json"
+    ]["schema"]["items"]["$ref"]
+    update_response = data["paths"]["/v1/ground-truths/{datasetName}/{bucket}/{item_id}"]["put"][
+        "responses"
+    ]["200"]["content"]["application/json"]["schema"]["$ref"]
+
+    assert "AgenticGroundTruthEntry" in import_request
+    assert "AgenticGroundTruthEntry" in update_response
+    assert "GroundTruthItem" not in import_request
+
+
+@pytest.mark.anyio
+async def test_update_requests_do_not_advertise_nullable_expected_tools(async_client: AsyncClient):
+    r = await async_client.get("/v1/openapi.json")
+    assert r.status_code == 200
+
+    data = r.json()
+    schemas = data["components"]["schemas"]
+
+    assignment_expected_tools = schemas["AssignmentUpdateRequest"]["properties"]["expectedTools"]
+    ground_truth_expected_tools = schemas["GroundTruthUpdateRequest"]["properties"]["expectedTools"]
+
+    assert assignment_expected_tools["$ref"] == "#/components/schemas/ExpectedTools"
+    assert ground_truth_expected_tools["$ref"] == "#/components/schemas/ExpectedTools"
+    assert "anyOf" not in assignment_expected_tools
+    assert "anyOf" not in ground_truth_expected_tools
+
+
+@pytest.mark.anyio
+async def test_update_requests_share_stable_history_patch_schema(async_client: AsyncClient):
+    r = await async_client.get("/v1/openapi.json")
+    assert r.status_code == 200
+
+    data = r.json()
+    schemas = data["components"]["schemas"]
+
+    assignment_history = schemas["AssignmentUpdateRequest"]["properties"]["history"]["anyOf"][0][
+        "items"
+    ]["$ref"]
+    ground_truth_history = schemas["GroundTruthUpdateRequest"]["properties"]["history"]["anyOf"][0][
+        "items"
+    ]["$ref"]
+
+    assert assignment_history == "#/components/schemas/HistoryEntryPatch"
+    assert ground_truth_history == "#/components/schemas/HistoryEntryPatch"
+    assert "HistoryEntryPatch" in schemas
diff --git a/backend/tests/unit/test_phase1_rework.py b/backend/tests/unit/test_phase1_rework.py
new file mode 100644
index 0000000..9810eef
--- /dev/null
+++ b/backend/tests/unit/test_phase1_rework.py
@@ -0,0 +1,1048 @@
+"""Phase 1 review rework regression tests.
+
+Tests cover:
+- IV-001: approve=true bulk import enforces generic approval validation
+- IV-002: Assignment route history edits reset totalReferences
+- IV-003: Invalid status values rejected with HTTP 400 on both routes
+- RR-001: Explicit status: null rejected with HTTP 400 on both update routes
+- RR-002: Bulk import failed count reports unique failed items, not raw error count
+- RR-003: Bulk import approval errors carry original request indices
+"""
+
+from __future__ import annotations
+
+from unittest.mock import AsyncMock
+from uuid import uuid4
+
+import pytest
+
+from fastapi import HTTPException
+
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+    BulkImportPersistenceError,
+    BulkImportResult,
+    HistoryEntry,
+)
+from app.domain.enums import GroundTruthStatus
+
+
+class TestBulkImportApprovalValidation:
+    """Test IV-001: approve=true bulk import enforces generic approval validation."""
+
+    @pytest.mark.asyncio
+    async def test_bulk_import_approve_rejects_items_without_history(self):
+        """Bulk import with approve=true should reject items that lack conversation history."""
+        from app.core.auth import UserContext
+        from app.container import container
+
+        # Prepare invalid item: no history, no question/answer
+        invalid_item = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="test-dataset",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+        )
+
+        payload = [invalid_item]
+
+        # Mock dependencies
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(imported=0, errors=[])
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            mock_user = UserContext(user_id="test-user")
+
+            # Import the bulk_import function directly
+            from app.api.v1.ground_truths import import_bulk
+
+            result = await import_bulk(
+                items=payload,
+                user=mock_user,
+                buckets=1,
+                approve=True,
+            )
+
+            # Should have errors for items that don't meet approval criteria
+            errors = result.errors
+            assert len(errors) > 0
+
+            # Check that approval validation error was raised
+            approval_errors = [e for e in errors if e.code == "APPROVAL_VALIDATION_FAILED"]
+            assert len(approval_errors) > 0
+
+            # Verify history requirement is in the error message
+            error_messages = [e.message for e in approval_errors]
+            assert any("history" in msg.lower() for msg in error_messages)
+
+            # No items should have been imported
+            assert result.imported == 0
+
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_bulk_import_approve_accepts_valid_items(self):
+        """Bulk import with approve=true should accept items that meet approval criteria."""
+        from app.core.auth import UserContext
+        from app.container import container
+
+        # Prepare valid item with history
+        valid_item = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="test-dataset",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="What is the capital of France?"),
+                HistoryEntry(role="assistant", msg="The capital of France is Paris."),
+            ],
+        )
+
+        payload = [valid_item]
+
+        # Mock dependencies
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(imported=1, errors=[])
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            mock_user = UserContext(user_id="test-user")
+
+            from app.api.v1.ground_truths import import_bulk
+
+            result = await import_bulk(
+                items=payload,
+                user=mock_user,
+                buckets=1,
+                approve=True,
+            )
+
+            # Should succeed with no errors
+            assert result.imported == 1
+            assert len(result.errors) == 0
+
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_bulk_import_approve_enforces_plugin_pack_approval_hooks(self):
+        """Plugin-pack approval errors must block bulk approve=true entries (R-001).
+
+        Regression test: the bulk import approval path must run
+        validate_item_for_approval() (which includes
+        plugin_pack_registry.collect_approval_errors) rather than the
+        generic-only collect_approval_validation_errors().
+        """
+        from app.core.auth import UserContext
+        from app.container import container
+
+        # A structurally valid item — generic core would approve it.
+        valid_item = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="test-dataset",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="Which city is the capital of France?"),
+                HistoryEntry(role="assistant", msg="Paris."),
+            ],
+        )
+
+        original_repo = container.repo
+        original_registry = container.plugin_pack_registry
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(imported=0, errors=[])
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        # Inject a mock plugin-pack registry that returns a pack-level error.
+        mock_registry = AsyncMock()
+        mock_registry.collect_approval_errors = lambda _item: [
+            "plugin-pack: retrieval reference is incomplete"
+        ]
+        mock_registry.filter_core_errors = lambda _item, errors: errors
+        container.plugin_pack_registry = mock_registry
+
+        try:
+            mock_user = UserContext(user_id="test-user")
+            from app.api.v1.ground_truths import import_bulk
+
+            result = await import_bulk(
+                items=[valid_item],
+                user=mock_user,
+                buckets=1,
+                approve=True,
+            )
+
+            # The plugin-pack error must surface as an APPROVAL_VALIDATION_FAILED entry.
+            approval_errors = [e for e in result.errors if e.code == "APPROVAL_VALIDATION_FAILED"]
+            assert len(approval_errors) >= 1, (
+                "Expected at least one APPROVAL_VALIDATION_FAILED error from plugin-pack hook"
+            )
+            assert any("plugin-pack" in e.message for e in approval_errors), (
+                "Error message should contain plugin-pack content"
+            )
+
+            # Original request index must be preserved (0 for a single-item request).
+            assert all(e.index == 0 for e in approval_errors)
+
+            # No items should have been imported.
+            assert result.imported == 0
+
+        finally:
+            container.repo = original_repo
+            container.plugin_pack_registry = original_registry
+
+
+class TestAssignmentHistoryReset:
+    """Test IV-002: Assignment route history edits reset totalReferences."""
+
+    @pytest.mark.asyncio
+    async def test_assignment_update_history_resets_total_references(self):
+        """When history is updated via assignment route, totalReferences should be reset to 0."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.assignments import update_item
+
+        dataset = "test-dataset"
+        bucket = str(uuid4())
+        item_id = str(uuid4())
+
+        # Create existing item with stale totalReferences
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName=dataset,
+            bucket=bucket,
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="Old question"),
+                HistoryEntry(role="assistant", msg="Old answer"),
+            ],
+            totalReferences=5,  # Stale value
+            _etag="test-etag",
+        )
+
+        # Mock dependencies
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        # Mock upsert to capture what gets saved
+        saved_item = None
+
+        async def mock_upsert(item):
+            nonlocal saved_item
+            saved_item = item
+            return item
+
+        container.repo.upsert_gt = AsyncMock(side_effect=mock_upsert)
+
+        try:
+            mock_user = UserContext(user_id="test-user")
+
+            # Payload with updated history
+            from app.api.v1.assignments import AssignmentUpdateRequest
+
+            payload = AssignmentUpdateRequest(
+                history=[
+                    {"role": "user", "msg": "New question"},
+                    {"role": "assistant", "msg": "New answer"},
+                ],
+                etag="test-etag",
+            )
+
+            result = await update_item(
+                dataset=dataset,
+                bucket=bucket,
+                item_id=item_id,
+                payload=payload,
+                user=mock_user,
+                if_match=None,
+            )
+
+            # Verify totalReferences was reset to 0
+            assert saved_item is not None
+            assert saved_item.totalReferences == 0
+
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_assignment_clear_history_resets_total_references(self):
+        """When history is cleared via assignment route, totalReferences should be reset to 0."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.assignments import update_item
+
+        dataset = "test-dataset"
+        bucket = str(uuid4())
+        item_id = str(uuid4())
+
+        # Create existing item with history and totalReferences
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName=dataset,
+            bucket=bucket,
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="Question"),
+                HistoryEntry(role="assistant", msg="Answer"),
+            ],
+            totalReferences=3,
+            _etag="test-etag",
+        )
+
+        # Mock dependencies
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        saved_item = None
+
+        async def mock_upsert(item):
+            nonlocal saved_item
+            saved_item = item
+            return item
+
+        container.repo.upsert_gt = AsyncMock(side_effect=mock_upsert)
+
+        try:
+            mock_user = UserContext(user_id="test-user")
+
+            from app.api.v1.assignments import AssignmentUpdateRequest
+
+            payload = AssignmentUpdateRequest(
+                history=None,  # Clear history
+                etag="test-etag",
+            )
+
+            result = await update_item(
+                dataset=dataset,
+                bucket=bucket,
+                item_id=item_id,
+                payload=payload,
+                user=mock_user,
+                if_match=None,
+            )
+
+            # Verify totalReferences was reset to 0
+            assert saved_item is not None
+            assert saved_item.totalReferences == 0
+
+        finally:
+            container.repo = original_repo
+
+
+class TestInvalidStatusRejection:
+    """Test IV-003: Invalid status values rejected with HTTP 400 on both routes."""
+
+    @pytest.mark.asyncio
+    async def test_ground_truths_route_rejects_invalid_status(self):
+        """Ground truths update route should reject invalid status with HTTP 400."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import update_ground_truth
+
+        dataset = "test-dataset"
+        bucket = str(uuid4())
+        item_id = str(uuid4())
+
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName=dataset,
+            bucket=bucket,
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            _etag="test-etag",
+        )
+
+        # Mock dependencies
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        try:
+            mock_user = UserContext(user_id="test-user")
+
+            from app.api.v1.ground_truths import GroundTruthUpdateRequest
+
+            payload = GroundTruthUpdateRequest(
+                status="invalid-status",  # Invalid status
+                etag="test-etag",
+            )
+
+            # Should raise HTTPException with 400
+            with pytest.raises(HTTPException) as exc_info:
+                await update_ground_truth(
+                    datasetName=dataset,
+                    bucket=bucket,
+                    item_id=item_id,
+                    payload=payload,
+                    user=mock_user,
+                    if_match=None,
+                )
+
+            assert exc_info.value.status_code == 400
+            assert "invalid status value" in exc_info.value.detail.lower()
+
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_assignments_route_rejects_invalid_status(self):
+        """Assignments update route should reject invalid status with HTTP 400."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.assignments import update_item
+
+        dataset = "test-dataset"
+        bucket = str(uuid4())
+        item_id = str(uuid4())
+
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName=dataset,
+            bucket=bucket,
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            assignedTo="test-user",
+            _etag="test-etag",
+        )
+
+        # Mock dependencies
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        try:
+            mock_user = UserContext(user_id="test-user")
+
+            from app.api.v1.assignments import AssignmentUpdateRequest
+
+            payload = AssignmentUpdateRequest(
+                status="bogus-status",  # Invalid status
+                etag="test-etag",
+            )
+
+            # Should raise HTTPException with 400
+            with pytest.raises(HTTPException) as exc_info:
+                await update_item(
+                    dataset=dataset,
+                    bucket=bucket,
+                    item_id=item_id,
+                    payload=payload,
+                    user=mock_user,
+                    if_match=None,
+                )
+
+            assert exc_info.value.status_code == 400
+            assert "invalid status value" in exc_info.value.detail.lower()
+
+        finally:
+            container.repo = original_repo
+
+
+class TestNullStatusRejection:
+    """RR-001: Explicit status: null must be rejected with HTTP 400 on both update routes."""
+
+    @pytest.mark.asyncio
+    async def test_ground_truth_route_rejects_null_status(self):
+        """Ground truths PUT route must return HTTP 400 for explicit status: null."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import update_ground_truth, GroundTruthUpdateRequest
+
+        item_id = str(uuid4())
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            _etag="e1",
+        )
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        try:
+            payload = GroundTruthUpdateRequest.model_validate({"status": None, "etag": "e1"})
+            with pytest.raises(HTTPException) as exc_info:
+                await update_ground_truth(
+                    datasetName="ds",
+                    bucket=existing_item.bucket,
+                    item_id=item_id,
+                    payload=payload,
+                    user=UserContext(user_id="u1"),
+                    if_match=None,
+                )
+            assert exc_info.value.status_code == 400
+            assert "null" in exc_info.value.detail.lower()
+        finally:
+            container.repo = original_repo
+
+
+class TestNullExpectedToolsRejection:
+    """RR-002: Explicit expectedTools: null must be rejected with HTTP 400 on both update routes."""
+
+    @pytest.mark.asyncio
+    async def test_ground_truth_route_rejects_null_expected_tools(self):
+        """Ground truths PUT route must return HTTP 400 for explicit expectedTools: null."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import update_ground_truth, GroundTruthUpdateRequest
+
+        item_id = str(uuid4())
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            _etag="e1",
+        )
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        try:
+            payload = GroundTruthUpdateRequest.model_validate({"expectedTools": None, "etag": "e1"})
+            with pytest.raises(HTTPException) as exc_info:
+                await update_ground_truth(
+                    datasetName="ds",
+                    bucket=existing_item.bucket,
+                    item_id=item_id,
+                    payload=payload,
+                    user=UserContext(user_id="u1"),
+                    if_match=None,
+                )
+            assert exc_info.value.status_code == 400
+            assert "expectedtools" in exc_info.value.detail.lower()
+            assert "null" in exc_info.value.detail.lower()
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_assignments_route_rejects_null_expected_tools(self):
+        """Assignments PUT route must return HTTP 400 for explicit expectedTools: null."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.assignments import update_item, AssignmentUpdateRequest
+
+        item_id = str(uuid4())
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            assignedTo="u1",
+            _etag="e1",
+        )
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        try:
+            payload = AssignmentUpdateRequest.model_validate({"expectedTools": None, "etag": "e1"})
+            with pytest.raises(HTTPException) as exc_info:
+                await update_item(
+                    dataset="ds",
+                    bucket=existing_item.bucket,
+                    item_id=item_id,
+                    payload=payload,
+                    user=UserContext(user_id="u1"),
+                    if_match=None,
+                )
+            assert exc_info.value.status_code == 400
+            assert "expectedtools" in exc_info.value.detail.lower()
+            assert "null" in exc_info.value.detail.lower()
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_assignments_route_rejects_null_status(self):
+        """Assignments PUT route must return HTTP 400 for explicit status: null."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.assignments import update_item, AssignmentUpdateRequest
+
+        item_id = str(uuid4())
+        existing_item = AgenticGroundTruthEntry(
+            id=item_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            assignedTo="u1",
+            _etag="e1",
+        )
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.get_gt = AsyncMock(return_value=existing_item)
+
+        try:
+            payload = AssignmentUpdateRequest.model_validate({"status": None, "etag": "e1"})
+            with pytest.raises(HTTPException) as exc_info:
+                await update_item(
+                    dataset="ds",
+                    bucket=existing_item.bucket,
+                    item_id=item_id,
+                    payload=payload,
+                    user=UserContext(user_id="u1"),
+                    if_match=None,
+                )
+            assert exc_info.value.status_code == 400
+            assert "null" in exc_info.value.detail.lower()
+        finally:
+            container.repo = original_repo
+
+    def test_ground_truth_update_request_schema_non_nullable_status(self):
+        """GroundTruthUpdateRequest must not advertise nullable status in OpenAPI schema."""
+        from app.api.v1.ground_truths import GroundTruthUpdateRequest
+
+        schema = GroundTruthUpdateRequest.model_json_schema()
+        status_prop = schema.get("properties", {}).get("status", {})
+        # status must NOT contain anyOf with null type
+        any_of = status_prop.get("anyOf", [])
+        null_entries = [e for e in any_of if e.get("type") == "null"]
+        assert not null_entries, f"status field advertises nullable in schema: {status_prop}"
+
+    def test_assignment_update_request_schema_non_nullable_status(self):
+        """AssignmentUpdateRequest must not advertise nullable status in OpenAPI schema."""
+        from app.api.v1.assignments import AssignmentUpdateRequest
+
+        schema = AssignmentUpdateRequest.model_json_schema()
+        status_prop = schema.get("properties", {}).get("status", {})
+        any_of = status_prop.get("anyOf", [])
+        null_entries = [e for e in any_of if e.get("type") == "null"]
+        assert not null_entries, f"status field advertises nullable in schema: {status_prop}"
+
+
+class TestBulkImportFailedCount:
+    """RR-002/RR-003: Bulk import failed count = unique failed items; indices preserved."""
+
+    @pytest.mark.asyncio
+    async def test_failed_count_is_unique_item_count_not_error_count(self):
+        """One item with multiple validation errors must count as 1 failed item."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import import_bulk
+
+        # Two invalid items; each produces at least one error via approval validation
+        item1 = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+        )
+        item2 = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+        )
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(imported=0, errors=[])
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            result = await import_bulk(
+                items=[item1, item2],
+                user=UserContext(user_id="u1"),
+                buckets=1,
+                approve=True,
+            )
+            # Both items fail, but failed must equal 2 (unique items), not the raw error count
+            assert result.failed == 2
+            assert result.imported == 0
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_approval_errors_carry_original_request_index(self):
+        """Approval errors must reference the original request index, not the filtered-list index."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import import_bulk
+        from app.domain.models import HistoryEntry
+
+        # First item is valid (passes tag validation), second is approval-invalid
+        item_valid = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="Q"),
+                HistoryEntry(role="assistant", msg="A"),
+            ],
+        )
+        item_invalid = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            # no history → approval validation fails
+        )
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(imported=1, errors=[])
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            # Request keeps the valid item first and the invalid item second.
+            result = await import_bulk(
+                items=[item_valid, item_invalid],
+                user=UserContext(user_id="u1"),
+                buckets=1,
+                approve=True,
+            )
+            assert result.imported == 1
+            assert result.failed == 1
+            # Error must reference original request index 1 (not 0 from filtered list)
+            approval_errors = [e for e in result.errors if e.code == "APPROVAL_VALIDATION_FAILED"]
+            assert approval_errors, "Expected APPROVAL_VALIDATION_FAILED errors"
+            assert all(e.index == 1 for e in approval_errors), (
+                f"Expected all error indices to be 1 (original request position), "
+                f"got: {[e.index for e in approval_errors]}"
+            )
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_persistence_errors_recover_original_request_index_and_unique_failed_count(self):
+        """Persistence errors should map back to request indices and count unique real item ids once."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import import_bulk
+
+        item0 = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+        )
+        item1 = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+        )
+        item2 = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+        )
+
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(
+                imported=0,
+                errors=[
+                    f"exists (article: article-1, id: {item1.id})",
+                    f"create_failed (article: article-1, id: {item1.id}): boom",
+                    f"create_failed (article: article-2, id: {item2.id}): boom",
+                ],
+            )
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            result = await import_bulk(
+                items=[item0, item1, item2],
+                user=UserContext(user_id="u1"),
+                buckets=1,
+                approve=False,
+            )
+
+            assert result.imported == 0
+            assert result.failed == 2
+            assert [error.index for error in result.errors] == [1, 1, 2]
+            assert [error.item_id for error in result.errors] == [item1.id, item1.id, item2.id]
+            assert [error.code for error in result.errors] == [
+                "DUPLICATE_ID",
+                "CREATE_FAILED",
+                "CREATE_FAILED",
+            ]
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_persistence_errors_without_item_id_keep_safe_fallback(self):
+        """Persistence errors without an id should still fall back to index=-1 and item_id=None."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import import_bulk
+
+        item = AgenticGroundTruthEntry(
+            id=str(uuid4()),
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+        )
+
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(
+                imported=0,
+                errors=["create_failed (article: article-1): boom"],
+            )
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            result = await import_bulk(
+                items=[item],
+                user=UserContext(user_id="u1"),
+                buckets=1,
+                approve=False,
+            )
+
+            assert result.imported == 0
+            assert result.failed == 1
+            assert len(result.errors) == 1
+            assert result.errors[0].index == -1
+            assert result.errors[0].item_id is None
+            assert result.errors[0].code == "CREATE_FAILED"
+        finally:
+            container.repo = original_repo
+
+
+class TestDuplicateIdBulkImport:
+    """Step 1.8 — Duplicate IDs in a single bulk-import request must not collapse
+    per-request-entry error attribution or undercount failed request entries.
+
+    Covers the IQ-001 finding from the 2026-03-11 plan review.
+    """
+
+    @pytest.mark.asyncio
+    async def test_duplicate_id_approval_error_uses_correct_request_index(self):
+        """[invalid(id=X), valid(id=X)] approve=true → error index=0, failed=1, imported=1."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import import_bulk
+        from app.domain.models import HistoryEntry
+
+        shared_id = str(uuid4())
+
+        # Item at index 0: no history → fails approval validation
+        item_invalid = AgenticGroundTruthEntry(
+            id=shared_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            # no history → collect_approval_validation_errors will complain
+        )
+        # Item at index 1: has history → passes approval validation
+        item_valid = AgenticGroundTruthEntry(
+            id=shared_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="Q"),
+                HistoryEntry(role="assistant", msg="A"),
+            ],
+        )
+
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(imported=1, errors=[])
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            result = await import_bulk(
+                items=[item_invalid, item_valid],
+                user=UserContext(user_id="u1"),
+                buckets=1,
+                approve=True,
+            )
+            assert result.imported == 1, f"Expected 1 imported, got {result.imported}"
+            assert result.failed == 1, f"Expected 1 failed, got {result.failed}"
+
+            approval_errors = [e for e in result.errors if e.code == "APPROVAL_VALIDATION_FAILED"]
+            assert approval_errors, "Expected APPROVAL_VALIDATION_FAILED errors"
+            assert all(e.index == 0 for e in approval_errors), (
+                f"Error must reference original request index 0 (the invalid entry), "
+                f"got indices: {[e.index for e in approval_errors]}"
+            )
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_duplicate_id_both_fail_approval_counts_two_failed(self):
+        """[invalid(id=X), invalid(id=X)] approve=true → failed=2, errors at index 0 and 1."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import import_bulk
+
+        shared_id = str(uuid4())
+
+        item0 = AgenticGroundTruthEntry(
+            id=shared_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            # no history
+        )
+        item1 = AgenticGroundTruthEntry(
+            id=shared_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            # no history
+        )
+
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(imported=0, errors=[])
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            result = await import_bulk(
+                items=[item0, item1],
+                user=UserContext(user_id="u1"),
+                buckets=1,
+                approve=True,
+            )
+            assert result.imported == 0
+            assert result.failed == 2, (
+                f"Both request entries must be counted as failed, got {result.failed}"
+            )
+            approval_errors = [e for e in result.errors if e.code == "APPROVAL_VALIDATION_FAILED"]
+            error_indices = sorted(e.index for e in approval_errors)
+            assert 0 in error_indices, "Expected an error at index 0"
+            assert 1 in error_indices, "Expected an error at index 1"
+        finally:
+            container.repo = original_repo
+
+    @pytest.mark.asyncio
+    async def test_duplicate_id_persistence_collision_uses_later_request_index(self):
+        """[valid(id=X), valid(id=X)] repo duplicate on second item → error index=1."""
+        from app.core.auth import UserContext
+        from app.container import container
+        from app.api.v1.ground_truths import import_bulk
+        from app.domain.models import HistoryEntry
+
+        shared_id = str(uuid4())
+
+        item0 = AgenticGroundTruthEntry(
+            id=shared_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="Q0"),
+                HistoryEntry(role="assistant", msg="A0"),
+            ],
+        )
+        item1 = AgenticGroundTruthEntry(
+            id=shared_id,
+            datasetName="ds",
+            bucket=str(uuid4()),
+            status=GroundTruthStatus.draft,
+            docType="ground-truth",
+            schemaVersion="agentic-v1",
+            history=[
+                HistoryEntry(role="user", msg="Q1"),
+                HistoryEntry(role="assistant", msg="A1"),
+            ],
+        )
+
+        original_repo = container.repo
+        container.repo = AsyncMock()
+        container.repo.import_bulk_gt = AsyncMock(
+            return_value=BulkImportResult(
+                imported=1,
+                errors=[f"exists (article: article-1, id: {shared_id})"],
+                persistence_errors=[
+                    BulkImportPersistenceError(
+                        message=f"exists (article: article-1, id: {shared_id})",
+                        item_id=shared_id,
+                        persistence_index=1,
+                    )
+                ],
+            )
+        )
+        container.repo.list_gt_paginated = AsyncMock(return_value=([], None))
+
+        try:
+            result = await import_bulk(
+                items=[item0, item1],
+                user=UserContext(user_id="u1"),
+                buckets=1,
+                approve=False,
+            )
+
+            assert result.imported == 1
+            assert result.failed == 1
+            assert len(result.errors) == 1
+            assert result.errors[0].code == "DUPLICATE_ID"
+            assert result.errors[0].item_id == shared_id
+            assert result.errors[0].index == 1
+        finally:
+            container.repo = original_repo
diff --git a/backend/tests/unit/test_pii_detection.py b/backend/tests/unit/test_pii_detection.py
index 586a02c..58673b8 100644
--- a/backend/tests/unit/test_pii_detection.py
+++ b/backend/tests/unit/test_pii_detection.py
@@ -20,8 +20,9 @@
     _mask_match,
     _create_snippet,
 )
-from app.domain.models import GroundTruthItem, HistoryItem
+from app.domain.models import HistoryItem
 from app.domain.enums import HistoryItemRole
+from tests.test_helpers import make_test_entry
 
 
 class TestEmailDetection:
@@ -165,44 +166,51 @@ class TestGroundTruthItemScanning:
 
     def test_scans_synth_question(self):
         """Should detect PII in synthQuestion field."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-1",
-            datasetName="test-dataset",
+            dataset_name="test-dataset",
             synth_question="Contact alice@example.com for help",
         )
         warnings = scan_item_for_pii(item)
-        assert len(warnings) == 1
-        assert warnings[0].field == "synthQuestion"
+        # Should find PII in multiple representations (history, plugin data, computed fields)
+        assert len(warnings) >= 1
+        # Check that at least one warning is for synthQuestion
+        assert any(w.field == "synthQuestion" for w in warnings)
+        assert any("email" in w.pattern_type for w in warnings)
 
     def test_scans_edited_question(self):
         """Should detect PII in editedQuestion field."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-1",
-            datasetName="test-dataset",
+            dataset_name="test-dataset",
             synth_question="Original question",
             edited_question="Contact support@company.org for assistance",
         )
         warnings = scan_item_for_pii(item)
-        assert len(warnings) == 1
-        assert warnings[0].field == "editedQuestion"
+        # Should find PII in multiple representations
+        assert len(warnings) >= 1
+        assert any(w.field == "editedQuestion" for w in warnings)
+        assert any("email" in w.pattern_type for w in warnings)
 
     def test_scans_answer(self):
         """Should detect PII in answer field."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-1",
-            datasetName="test-dataset",
+            dataset_name="test-dataset",
             synth_question="What is the contact?",
             answer="Call us at (555) 123-4567",
         )
         warnings = scan_item_for_pii(item)
-        assert len(warnings) == 1
-        assert warnings[0].field == "answer"
+        # Should find PII in multiple representations
+        assert len(warnings) >= 1
+        assert any(w.field == "answer" for w in warnings)
+        assert any("phone" in w.pattern_type for w in warnings)
 
     def test_scans_comment(self):
         """Should detect PII in comment field."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-1",
-            datasetName="test-dataset",
+            dataset_name="test-dataset",
             synth_question="A question",
             comment="Reviewed by john@internal.com",
         )
@@ -212,9 +220,9 @@ def test_scans_comment(self):
 
     def test_scans_history_messages(self):
         """Should detect PII in history messages."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-1",
-            datasetName="test-dataset",
+            dataset_name="test-dataset",
             synth_question="A question",
             history=[
                 HistoryItem(role=HistoryItemRole.user, msg="Contact alice@example.com"),
@@ -231,15 +239,51 @@ def test_scans_history_messages(self):
 
     def test_returns_empty_for_clean_item(self):
         """Should return empty list for item without PII."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="test-1",
-            datasetName="test-dataset",
+            dataset_name="test-dataset",
             synth_question="What is the weather like today?",
             answer="The weather is sunny and warm.",
         )
         warnings = scan_item_for_pii(item)
         assert len(warnings) == 0
 
+    def test_scans_generic_context_entries_and_trace_payload(self):
+        item = make_test_entry(
+            id="test-1",
+            dataset_name="test-dataset",
+            contextEntries=[{"key": "customerEmail", "value": "alice@example.com"}],
+            tracePayload={"notes": "Call me at 555-123-4567"},
+        )
+
+        warnings = scan_item_for_pii(item)
+        fields = {warning.field for warning in warnings}
+        assert "contextEntries[0].value" in fields
+        assert "tracePayload.notes" in fields
+
+    def test_scans_generic_tool_calls_and_plugins(self):
+        item = make_test_entry(
+            id="test-1",
+            dataset_name="test-dataset",
+            toolCalls=[
+                {
+                    "name": "lookup_customer",
+                    "response": {"email": "agent@example.com"},
+                }
+            ],
+            plugins={
+                "rag-compat": {
+                    "kind": "rag-compat",
+                    "data": {"phone": "(555) 123-4567"},
+                }
+            },
+        )
+
+        warnings = scan_item_for_pii(item)
+        fields = {warning.field for warning in warnings}
+        assert "toolCalls[0].response.email" in fields
+        assert "plugins.rag-compat.data.phone" in fields
+
 
 class TestBulkScanning:
     """Tests for bulk item scanning."""
@@ -247,27 +291,30 @@ class TestBulkScanning:
     def test_scans_multiple_items(self):
         """Should scan all items and aggregate warnings."""
         items = [
-            GroundTruthItem(
+            make_test_entry(
                 id="item-1",
-                datasetName="test-dataset",
+                dataset_name="test-dataset",
                 synth_question="Contact alice@example.com",
             ),
-            GroundTruthItem(
+            make_test_entry(
                 id="item-2",
-                datasetName="test-dataset",
+                dataset_name="test-dataset",
                 synth_question="No PII here",
             ),
-            GroundTruthItem(
+            make_test_entry(
                 id="item-3",
-                datasetName="test-dataset",
+                dataset_name="test-dataset",
                 synth_question="Call (555) 123-4567",
             ),
         ]
         warnings = scan_bulk_items_for_pii(items)
-        assert len(warnings) == 2
+        # Should find PII in item-1 and item-3 (multiple representations each)
+        assert len(warnings) >= 2
         item_ids = {w.item_id for w in warnings}
         assert "item-1" in item_ids
         assert "item-3" in item_ids
+        # item-2 should not have any warnings
+        assert "item-2" not in item_ids
 
     def test_handles_empty_list(self):
         """Should return empty list for empty input."""
@@ -288,16 +335,17 @@ def test_handles_empty_text(self):
 
     def test_handles_item_without_id(self):
         """Should handle item with missing/blank ID."""
-        item = GroundTruthItem(
+        item = make_test_entry(
             id="",
-            datasetName="test-dataset",
+            dataset_name="test-dataset",
             synth_question="Contact user@example.com",
         )
         # Need to set id after construction since blank is usually generated
         item.id = ""
         warnings = scan_item_for_pii(item)
-        assert len(warnings) == 1
-        assert warnings[0].item_id == "(no ID)"
+        # Should find PII with "(no ID)" as the item_id
+        assert len(warnings) >= 1
+        assert all(w.item_id == "(no ID)" for w in warnings)
 
     def test_mixed_pii_types_in_single_field(self):
         """Should detect multiple PII types in the same field."""
diff --git a/backend/tests/unit/test_plugin_pack_extension.py b/backend/tests/unit/test_plugin_pack_extension.py
new file mode 100644
index 0000000..c4d1784
--- /dev/null
+++ b/backend/tests/unit/test_plugin_pack_extension.py
@@ -0,0 +1,208 @@
+"""Unit tests for runtime-backed PluginPack extension seams.
+
+These assertions stay because the extension hooks are wired into runtime flows
+for stats, explorer fields, and import/export registration. They are not legacy
+migration coverage.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+from app.plugins.base import (
+    ExplorerFieldDefinition,
+    ExportTransform,
+    ImportTransform,
+    PluginPack,
+    PluginPackRegistry,
+)
+
+
+# ---------------------------------------------------------------------------
+# Test double packs
+# ---------------------------------------------------------------------------
+
+
+class StatsPack(PluginPack):
+    @property
+    def name(self) -> str:
+        return "stats-pack"
+
+    def get_stats_contribution(self, base_stats: dict[str, Any]) -> dict[str, Any]:
+        return {
+            "stats-pack:customCount": 42,
+            "stats-pack:ratio": base_stats.get("total", 0) / 100,
+        }
+
+
+class ExplorerPack(PluginPack):
+    @property
+    def name(self) -> str:
+        return "explorer-pack"
+
+    def get_explorer_fields(self) -> list[ExplorerFieldDefinition]:
+        return [
+            ExplorerFieldDefinition(
+                key="explorer-pack:score",
+                label="Score",
+                field_type="number",
+                sortable=True,
+                filterable=True,
+            ),
+            ExplorerFieldDefinition(
+                key="explorer-pack:category",
+                label="Category",
+                field_type="string",
+                filterable=True,
+            ),
+        ]
+
+
+class TransformPack(PluginPack):
+    @property
+    def name(self) -> str:
+        return "transform-pack"
+
+    def get_import_transforms(self) -> list[ImportTransform]:
+        return [
+            ImportTransform(
+                name="transform-pack:normalize",
+                description="Normalize field casing",
+            )
+        ]
+
+    def get_export_transforms(self) -> list[ExportTransform]:
+        return [
+            ExportTransform(
+                name="transform-pack:flatten",
+                description="Flatten nested fields",
+            )
+        ]
+
+
+class NoOpExtensionPack(PluginPack):
+    @property
+    def name(self) -> str:
+        return "no-op-ext"
+
+
+# ---------------------------------------------------------------------------
+# Default no-op behavior
+# ---------------------------------------------------------------------------
+
+
+def test_default_stats_contribution_is_empty():
+    pack = NoOpExtensionPack()
+    assert pack.get_stats_contribution({"total": 100}) == {}
+
+
+def test_default_explorer_fields_is_empty():
+    pack = NoOpExtensionPack()
+    assert pack.get_explorer_fields() == []
+
+
+def test_default_import_transforms_is_empty():
+    pack = NoOpExtensionPack()
+    assert pack.get_import_transforms() == []
+
+
+def test_default_export_transforms_is_empty():
+    pack = NoOpExtensionPack()
+    assert pack.get_export_transforms() == []
+
+
+# ---------------------------------------------------------------------------
+# Stats aggregation
+# ---------------------------------------------------------------------------
+
+
+def test_collect_stats_returns_base_when_no_packs():
+    registry = PluginPackRegistry()
+    base = {"total": 100, "approved": 50}
+    result = registry.collect_stats(base)
+    assert result == base
+
+
+def test_collect_stats_merges_pack_contributions():
+    registry = PluginPackRegistry()
+    registry.register(StatsPack())
+    base = {"total": 100, "approved": 50}
+    result = registry.collect_stats(base)
+    assert result["total"] == 100
+    assert result["approved"] == 50
+    assert result["stats-pack:customCount"] == 42
+    assert result["stats-pack:ratio"] == 1.0
+
+
+def test_collect_stats_pack_key_overwrites_base_on_collision():
+    registry = PluginPackRegistry()
+
+    class OverwritePack(PluginPack):
+        @property
+        def name(self) -> str:
+            return "overwrite"
+
+        def get_stats_contribution(self, base_stats: dict[str, Any]) -> dict[str, Any]:
+            return {"total": 999}
+
+    registry.register(OverwritePack())
+    result = registry.collect_stats({"total": 100})
+    assert result["total"] == 999
+
+
+# ---------------------------------------------------------------------------
+# Explorer fields aggregation
+# ---------------------------------------------------------------------------
+
+
+def test_collect_explorer_fields_empty_registry():
+    registry = PluginPackRegistry()
+    assert registry.collect_explorer_fields() == []
+
+
+def test_collect_explorer_fields_populates_pack_name():
+    registry = PluginPackRegistry()
+    registry.register(ExplorerPack())
+    fields = registry.collect_explorer_fields()
+    assert len(fields) == 2
+    assert all(f.pack_name == "explorer-pack" for f in fields)
+    assert fields[0].key == "explorer-pack:score"
+    assert fields[1].key == "explorer-pack:category"
+
+
+def test_collect_explorer_fields_from_multiple_packs():
+    registry = PluginPackRegistry()
+    registry.register(ExplorerPack())
+    registry.register(NoOpExtensionPack())
+    fields = registry.collect_explorer_fields()
+    assert len(fields) == 2  # only ExplorerPack contributes
+
+
+# ---------------------------------------------------------------------------
+# Import/export transform aggregation
+# ---------------------------------------------------------------------------
+
+
+def test_collect_import_transforms_populates_pack_name():
+    registry = PluginPackRegistry()
+    registry.register(TransformPack())
+    transforms = registry.collect_import_transforms()
+    assert len(transforms) == 1
+    assert transforms[0].pack_name == "transform-pack"
+    assert transforms[0].name == "transform-pack:normalize"
+
+
+def test_collect_export_transforms_populates_pack_name():
+    registry = PluginPackRegistry()
+    registry.register(TransformPack())
+    transforms = registry.collect_export_transforms()
+    assert len(transforms) == 1
+    assert transforms[0].pack_name == "transform-pack"
+    assert transforms[0].name == "transform-pack:flatten"
+
+
+def test_collect_transforms_empty_when_no_contributing_packs():
+    registry = PluginPackRegistry()
+    registry.register(NoOpExtensionPack())
+    assert registry.collect_import_transforms() == []
+    assert registry.collect_export_transforms() == []
diff --git a/backend/tests/unit/test_plugin_pack_registry.py b/backend/tests/unit/test_plugin_pack_registry.py
new file mode 100644
index 0000000..4601488
--- /dev/null
+++ b/backend/tests/unit/test_plugin_pack_registry.py
@@ -0,0 +1,222 @@
+"""Unit tests for PluginPack ABC and PluginPackRegistry.
+
+Tests cover:
+- Successful pack registration
+- Duplicate-name rejection
+- Empty-name rejection
+- validate_all() calls each pack's validate_registration()
+- validate_all() wraps errors with pack name context
+- collect_approval_errors() aggregates across all packs
+- get() and names() accessors
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from app.domain.models import AgenticGroundTruthEntry
+from app.plugins.base import PluginPack, PluginPackRegistry
+
+
+# ---------------------------------------------------------------------------
+# Test double packs
+# ---------------------------------------------------------------------------
+
+
+class NoOpPack(PluginPack):
+    @property
+    def name(self) -> str:
+        return "no-op"
+
+
+class AlwaysErrorPack(PluginPack):
+    """A pack whose validate_registration always fails."""
+
+    @property
+    def name(self) -> str:
+        return "always-error"
+
+    def validate_registration(self) -> None:
+        raise ValueError("intentional startup failure for testing")
+
+
+class ApprovalErrorPack(PluginPack):
+    """A pack that always appends an approval error."""
+
+    @property
+    def name(self) -> str:
+        return "approval-error"
+
+    def collect_approval_errors(self, item: AgenticGroundTruthEntry) -> list[str]:
+        return [f"approval-error-pack: item {item.id} failed"]
+
+
+class ConditionalApprovalPack(PluginPack):
+    """A pack that errors on items with a specific dataset name."""
+
+    @property
+    def name(self) -> str:
+        return "conditional"
+
+    def collect_approval_errors(self, item: AgenticGroundTruthEntry) -> list[str]:
+        if item.datasetName == "forbidden":
+            return ["conditional: forbidden dataset cannot be approved"]
+        return []
+
+
+# ---------------------------------------------------------------------------
+# Registration tests
+# ---------------------------------------------------------------------------
+
+
+def test_register_single_pack_succeeds():
+    registry = PluginPackRegistry()
+    registry.register(NoOpPack())
+    assert len(registry) == 1
+
+
+def test_register_multiple_packs_succeeds():
+    registry = PluginPackRegistry()
+    registry.register(NoOpPack())
+    registry.register(ApprovalErrorPack())
+    assert len(registry) == 2
+
+
+def test_register_duplicate_name_raises():
+    registry = PluginPackRegistry()
+    registry.register(NoOpPack())
+    with pytest.raises(ValueError, match="Duplicate plugin pack name 'no-op'"):
+        registry.register(NoOpPack())
+
+
+def test_register_empty_name_raises():
+    class EmptyNamePack(PluginPack):
+        @property
+        def name(self) -> str:
+            return ""
+
+    registry = PluginPackRegistry()
+    with pytest.raises(ValueError, match="non-empty string"):
+        registry.register(EmptyNamePack())
+
+
+def test_register_whitespace_name_raises():
+    class WhitespacePack(PluginPack):
+        @property
+        def name(self) -> str:
+            return "   "
+
+    registry = PluginPackRegistry()
+    with pytest.raises(ValueError, match="non-empty string"):
+        registry.register(WhitespacePack())
+
+
+# ---------------------------------------------------------------------------
+# validate_all tests
+# ---------------------------------------------------------------------------
+
+
+def test_validate_all_passes_for_valid_packs():
+    registry = PluginPackRegistry()
+    registry.register(NoOpPack())
+    registry.validate_all()  # should not raise
+
+
+def test_validate_all_fails_with_pack_name_in_message():
+    registry = PluginPackRegistry()
+    registry.register(AlwaysErrorPack())
+    with pytest.raises(ValueError, match="Plugin pack 'always-error' failed startup validation"):
+        registry.validate_all()
+
+
+def test_validate_all_includes_original_error_message():
+    registry = PluginPackRegistry()
+    registry.register(AlwaysErrorPack())
+    with pytest.raises(ValueError, match="intentional startup failure for testing"):
+        registry.validate_all()
+
+
+def test_validate_all_empty_registry_passes():
+    registry = PluginPackRegistry()
+    registry.validate_all()  # no packs — should not raise
+
+
+# ---------------------------------------------------------------------------
+# collect_approval_errors tests
+# ---------------------------------------------------------------------------
+
+
+def _make_item(item_id: str = "t-001", dataset: str = "demo") -> AgenticGroundTruthEntry:
+    return AgenticGroundTruthEntry(
+        id=item_id,
+        datasetName=dataset,
+        history=[
+            {"role": "user", "msg": "hello"},
+            {"role": "assistant", "msg": "world"},
+        ],
+    )
+
+
+def test_collect_approval_errors_empty_registry():
+    registry = PluginPackRegistry()
+    item = _make_item()
+    assert registry.collect_approval_errors(item) == []
+
+
+def test_collect_approval_errors_no_op_pack_returns_empty():
+    registry = PluginPackRegistry()
+    registry.register(NoOpPack())
+    item = _make_item()
+    assert registry.collect_approval_errors(item) == []
+
+
+def test_collect_approval_errors_error_pack_returns_errors():
+    registry = PluginPackRegistry()
+    registry.register(ApprovalErrorPack())
+    item = _make_item(item_id="x-1")
+    errors = registry.collect_approval_errors(item)
+    assert len(errors) == 1
+    assert "x-1" in errors[0]
+
+
+def test_collect_approval_errors_aggregates_across_packs():
+    registry = PluginPackRegistry()
+    registry.register(ApprovalErrorPack())
+    registry.register(ConditionalApprovalPack())
+
+    # Item where ConditionalApprovalPack also fires
+    item = _make_item(dataset="forbidden")
+    errors = registry.collect_approval_errors(item)
+    assert len(errors) == 2  # one from each pack
+
+
+def test_collect_approval_errors_conditional_pack_silent_for_allowed_dataset():
+    registry = PluginPackRegistry()
+    registry.register(ConditionalApprovalPack())
+    item = _make_item(dataset="allowed")
+    assert registry.collect_approval_errors(item) == []
+
+
+# ---------------------------------------------------------------------------
+# Accessor tests
+# ---------------------------------------------------------------------------
+
+
+def test_get_registered_pack_by_name():
+    registry = PluginPackRegistry()
+    pack = NoOpPack()
+    registry.register(pack)
+    assert registry.get("no-op") is pack
+
+
+def test_get_unregistered_pack_returns_none():
+    registry = PluginPackRegistry()
+    assert registry.get("missing") is None
+
+
+def test_names_returns_sorted_list():
+    registry = PluginPackRegistry()
+    registry.register(ConditionalApprovalPack())
+    registry.register(ApprovalErrorPack())
+    registry.register(NoOpPack())
+    assert registry.names() == ["approval-error", "conditional", "no-op"]
diff --git a/backend/tests/unit/test_rag_compat_approval.py b/backend/tests/unit/test_rag_compat_approval.py
new file mode 100644
index 0000000..951f505
--- /dev/null
+++ b/backend/tests/unit/test_rag_compat_approval.py
@@ -0,0 +1,168 @@
+"""Unit tests for RAG approval waiver migration.
+
+Validates that the RAG-specific assistant-message and required-tools waivers
+are owned by RagCompatPack.collect_approval_waivers() rather than being
+hard-coded in the core validation_service.
+"""
+
+from __future__ import annotations
+
+
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+)
+from app.plugins.packs.rag_compat import RagCompatPack
+from app.services.validation_service import collect_approval_validation_errors
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_item(**overrides) -> AgenticGroundTruthEntry:
+    defaults = {
+        "id": "rag-test-1",
+        "datasetName": "demo",
+        "synthQuestion": "What is X?",
+    }
+    defaults.update(overrides)
+    return AgenticGroundTruthEntry.model_validate(defaults)
+
+
+# ---------------------------------------------------------------------------
+# Core validation (no plugin intervention) — strict after waiver removal
+# ---------------------------------------------------------------------------
+
+
+def test_core_requires_assistant_message_even_with_refs():
+    """After waiver removal, core always generates the assistant error."""
+    item = _make_item(
+        history=[{"role": "user", "msg": "hello"}],
+        totalReferences=5,
+    )
+    errors = collect_approval_validation_errors(item)
+    assert "history must include at least one assistant message" in errors
+
+
+def test_core_no_error_when_assistant_present():
+    item = _make_item(
+        history=[
+            {"role": "user", "msg": "hello"},
+            {"role": "assistant", "msg": "world"},
+        ],
+    )
+    errors = collect_approval_validation_errors(item)
+    assert errors == []
+
+
+# ---------------------------------------------------------------------------
+# RagCompatPack waiver — assistant-message
+# ---------------------------------------------------------------------------
+
+
+def test_rag_pack_waives_assistant_error_when_refs_present():
+    pack = RagCompatPack()
+    item = _make_item(
+        history=[{"role": "user", "msg": "hello"}],
+        totalReferences=3,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    waivers = pack.collect_approval_waivers(item, core_errors)
+    assert "history must include at least one assistant message" in waivers
+
+
+def test_rag_pack_no_waiver_when_refs_zero():
+    pack = RagCompatPack()
+    item = _make_item(
+        history=[{"role": "user", "msg": "hello"}],
+        totalReferences=0,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    waivers = pack.collect_approval_waivers(item, core_errors)
+    assert waivers == []
+
+
+def test_rag_pack_does_not_waive_user_message_error():
+    """The pack should only waive assistant-message and required-tools errors."""
+    pack = RagCompatPack()
+    item = _make_item(
+        history=[{"role": "assistant", "msg": "answer"}],
+        totalReferences=5,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    waivers = pack.collect_approval_waivers(item, core_errors)
+    # "history must include at least one user message" should NOT be waived
+    assert "history must include at least one user message" not in waivers
+
+
+# ---------------------------------------------------------------------------
+# RagCompatPack waiver — required-tools
+# ---------------------------------------------------------------------------
+
+
+def test_rag_pack_waives_required_tools_error_when_refs_present():
+    pack = RagCompatPack()
+    item = _make_item(
+        history=[
+            {"role": "user", "msg": "hello"},
+            {"role": "assistant", "msg": "world"},
+        ],
+        toolCalls=[{"name": "search"}],
+        totalReferences=3,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    waivers = pack.collect_approval_waivers(item, core_errors)
+    assert any("expectedTools.required" in w for w in waivers)
+
+
+def test_rag_pack_no_required_tools_waiver_when_refs_zero():
+    pack = RagCompatPack()
+    item = _make_item(
+        history=[
+            {"role": "user", "msg": "hello"},
+            {"role": "assistant", "msg": "world"},
+        ],
+        toolCalls=[{"name": "search"}],
+        totalReferences=0,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    waivers = pack.collect_approval_waivers(item, core_errors)
+    assert waivers == []
+
+
+# ---------------------------------------------------------------------------
+# Registry-level waiver filtering (integration with PluginPackRegistry)
+# ---------------------------------------------------------------------------
+
+
+def test_registry_filters_waived_errors():
+    from app.plugins.base import PluginPackRegistry
+
+    registry = PluginPackRegistry()
+    registry.register(RagCompatPack())
+    registry.validate_all()
+
+    item = _make_item(
+        history=[{"role": "user", "msg": "hello"}],
+        totalReferences=3,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    assert "history must include at least one assistant message" in core_errors
+
+    filtered = registry.filter_core_errors(item, core_errors)
+    assert "history must include at least one assistant message" not in filtered
+
+
+def test_registry_preserves_non_waived_errors():
+    from app.plugins.base import PluginPackRegistry
+
+    registry = PluginPackRegistry()
+    registry.register(RagCompatPack())
+
+    # Item with no history, no question, no answer → "no conversation message" error
+    item = _make_item(synthQuestion="", totalReferences=5)
+    core_errors = collect_approval_validation_errors(item)
+    filtered = registry.filter_core_errors(item, core_errors)
+    # "history must contain at least one conversation message" is NOT waived
+    assert any("at least one conversation message" in e for e in filtered)
diff --git a/backend/tests/unit/test_rag_compat_pack.py b/backend/tests/unit/test_rag_compat_pack.py
new file mode 100644
index 0000000..84259cc
--- /dev/null
+++ b/backend/tests/unit/test_rag_compat_pack.py
@@ -0,0 +1,316 @@
+"""Unit tests for RagCompatPack plugin contracts and migration helpers.
+
+Core-generic behavior stays covered elsewhere. This file focuses on:
+- runtime-backed pack registration and registry presence
+- stable helper contracts for retrieval/reference ownership
+- compat-migration helpers that still project legacy payloads while the shim exists
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from app.domain.models import AgenticGroundTruthEntry, Reference
+from app.plugins.packs.rag_compat import RagCompatPack, _RAG_COMPAT_KIND
+from app.plugins.pack_registry import (
+    get_default_pack_registry,
+    reset_default_pack_registry,
+)
+
+
+# ---------------------------------------------------------------------------
+# validate_registration
+# ---------------------------------------------------------------------------
+
+
+def test_validate_registration_passes():
+    """RagCompatPack registers successfully when constants are in sync."""
+    pack = RagCompatPack()
+    pack.validate_registration()  # should not raise
+
+
+def test_validate_registration_name_matches_host_model_constant():
+    """The pack name must equal AgenticGroundTruthEntry._RAG_COMPAT_PLUGIN."""
+    from app.domain.models import AgenticGroundTruthEntry
+
+    pack = RagCompatPack()
+    assert pack.name == AgenticGroundTruthEntry._RAG_COMPAT_PLUGIN
+
+
+def test_validate_registration_kind_constant_correct():
+    """_RAG_COMPAT_KIND must match the host model constant."""
+    from app.domain.models import AgenticGroundTruthEntry
+
+    assert _RAG_COMPAT_KIND == AgenticGroundTruthEntry._RAG_COMPAT_PLUGIN
+
+
+def test_validate_registration_fails_on_constant_mismatch(monkeypatch: pytest.MonkeyPatch):
+    """validate_registration() must raise ValueError if constants diverge."""
+    pack = RagCompatPack()
+    # Simulate a rename of the host-model constant
+    monkeypatch.setattr(AgenticGroundTruthEntry, "_RAG_COMPAT_PLUGIN", "rag-v2")
+    with pytest.raises(ValueError, match="does not match"):
+        pack.validate_registration()
+
+
+# ---------------------------------------------------------------------------
+# Plugin-contract: approval hooks
+# ---------------------------------------------------------------------------
+
+
+def _generic_item() -> AgenticGroundTruthEntry:
+    return AgenticGroundTruthEntry(
+        id="gen-001",
+        datasetName="generic-dataset",
+        history=[
+            {"role": "user", "msg": "What is 2+2?"},
+            {"role": "assistant", "msg": "4"},
+        ],
+    )
+
+
+def _rag_item() -> AgenticGroundTruthEntry:
+    return AgenticGroundTruthEntry.model_validate(
+        {
+            "id": "rag-001",
+            "datasetName": "rag-dataset",
+            "synthQuestion": "What is retrieval?",
+            "answer": "Retrieval is finding relevant docs.",
+            "refs": [{"url": "https://example.com/doc"}],
+        }
+    )
+
+
+def test_collect_approval_errors_generic_item_empty():
+    pack = RagCompatPack()
+    item = _generic_item()
+    assert pack.collect_approval_errors(item) == []
+
+
+def test_collect_approval_errors_rag_item_empty():
+    """RAG items currently produce no additional pack-level errors."""
+    pack = RagCompatPack()
+    item = _rag_item()
+    assert pack.collect_approval_errors(item) == []
+
+
+# ---------------------------------------------------------------------------
+# Plugin-contract: helper accessors
+# ---------------------------------------------------------------------------
+
+
+def test_rag_compat_data_empty_for_generic_item():
+    pack = RagCompatPack()
+    item = _generic_item()
+    assert pack.rag_compat_data(item) == {}
+
+
+def test_rag_compat_data_populated_for_rag_item():
+    pack = RagCompatPack()
+    item = _rag_item()
+    data = pack.rag_compat_data(item)
+    # The model_validator moves synthQuestion, answer, refs into rag-compat plugin data
+    assert data  # non-empty
+
+
+def test_rag_compat_data_contains_synth_question():
+    pack = RagCompatPack()
+    item = _rag_item()
+    data = pack.rag_compat_data(item)
+    assert "synthQuestion" in data
+    assert data["synthQuestion"] == "What is retrieval?"
+
+
+# ---------------------------------------------------------------------------
+# refs_from_item accessor
+# ---------------------------------------------------------------------------
+
+
+def test_refs_from_item_empty_for_generic_item():
+    pack = RagCompatPack()
+    item = _generic_item()
+    assert pack.refs_from_item(item) == []
+
+
+def test_refs_from_item_populated_for_rag_item():
+    pack = RagCompatPack()
+    item = _rag_item()
+    refs = pack.refs_from_item(item)
+    assert len(refs) == 1
+    assert isinstance(refs[0], Reference)
+    assert refs[0].url == "https://example.com/doc"
+
+
+def test_refs_from_item_flattens_per_call_retrieval_state():
+    pack = RagCompatPack()
+    item = AgenticGroundTruthEntry.model_validate(
+        {
+            "id": "rag-002",
+            "datasetName": "rag-dataset",
+            "toolCalls": [{"id": "tc-1", "name": "search", "callType": "tool", "stepNumber": 2}],
+            "plugins": {
+                "rag-compat": {
+                    "kind": "rag-compat",
+                    "data": {
+                        "retrievals": {
+                            "tc-1": {
+                                "candidates": [
+                                    {
+                                        "url": "https://example.com/candidate",
+                                        "title": "Candidate",
+                                        "chunk": "retrieved chunk",
+                                    }
+                                ]
+                            }
+                        }
+                    },
+                }
+            },
+        }
+    )
+
+    refs = pack.refs_from_item(item)
+    assert len(refs) == 1
+    assert refs[0].url == "https://example.com/candidate"
+    assert refs[0].content == "retrieved chunk"
+    assert refs[0].messageIndex == 2
+
+
+# ---------------------------------------------------------------------------
+# Plugin-contract: reference ownership helpers
+# ---------------------------------------------------------------------------
+
+
+def test_attach_reference_adds_to_rag_item():
+    pack = RagCompatPack()
+    item = _rag_item()
+    initial_count = len(pack.refs_from_item(item))
+    new_ref = Reference(url="https://newdoc.example.com/page")
+    result = pack.attach_reference(item, new_ref)
+    assert result is item  # mutated in-place
+    assert len(pack.refs_from_item(item)) == initial_count + 1
+    urls = [r.url for r in pack.refs_from_item(item)]
+    assert "https://newdoc.example.com/page" in urls
+
+
+def test_attach_reference_works_on_generic_item():
+    pack = RagCompatPack()
+    item = _generic_item()
+    new_ref = Reference(url="https://docs.example.com/a")
+    pack.attach_reference(item, new_ref)
+    # The ref is written to rag-compat plugin payload via the setter
+    refs = pack.refs_from_item(item)
+    assert len(refs) == 1
+    assert refs[0].url == "https://docs.example.com/a"
+
+
+def test_detach_reference_removes_by_url():
+    pack = RagCompatPack()
+    item = _rag_item()
+    target_url = "https://example.com/doc"
+    assert any(r.url == target_url for r in pack.refs_from_item(item))
+
+    result = pack.detach_reference(item, target_url)
+    assert result is item
+    assert not any(r.url == target_url for r in pack.refs_from_item(item))
+
+
+def test_detach_reference_nonexistent_url_is_noop():
+    pack = RagCompatPack()
+    item = _rag_item()
+    before = len(pack.refs_from_item(item))
+    pack.detach_reference(item, "https://nonexistent.example.com")
+    assert len(pack.refs_from_item(item)) == before
+
+
+def test_replace_references_clears_per_call_retrieval_state():
+    pack = RagCompatPack()
+    item = AgenticGroundTruthEntry.model_validate(
+        {
+            "id": "rag-003",
+            "datasetName": "rag-dataset",
+            "plugins": {
+                "rag-compat": {
+                    "kind": "rag-compat",
+                    "data": {
+                        "retrievals": {"tc-1": {"candidates": [{"url": "https://example.com/old"}]}}
+                    },
+                }
+            },
+        }
+    )
+
+    pack.replace_references(item, [Reference(url="https://example.com/new")])
+
+    assert pack.has_per_call_state(item) is False
+    refs = pack.refs_from_item(item)
+    assert len(refs) == 1
+    assert refs[0].url == "https://example.com/new"
+
+
+def test_export_transform_projects_retrieval_candidates_to_refs():
+    pack = RagCompatPack()
+    transform = pack.get_export_transforms()[0].transform
+
+    projected = transform(
+        {
+            "id": "rag-004",
+            "datasetName": "rag-dataset",
+            "toolCalls": [{"id": "tc-1", "stepNumber": 1}],
+            "plugins": {
+                "rag-compat": {
+                    "kind": "rag-compat",
+                    "data": {
+                        "retrievals": {
+                            "tc-1": {
+                                "candidates": [
+                                    {
+                                        "url": "https://example.com/exported",
+                                        "title": "Exported",
+                                        "chunk": "retrieved chunk",
+                                    }
+                                ]
+                            }
+                        }
+                    },
+                }
+            },
+        }
+    )
+
+    assert projected["totalReferences"] == 1
+    assert projected["refs"][0]["url"] == "https://example.com/exported"
+    assert projected["refs"][0]["messageIndex"] == 1
+
+
+# ---------------------------------------------------------------------------
+# Runtime-backed registry seam
+# ---------------------------------------------------------------------------
+
+
+def test_default_pack_registry_contains_rag_compat():
+    reset_default_pack_registry()
+    try:
+        registry = get_default_pack_registry()
+        assert "rag-compat" in registry.names()
+    finally:
+        reset_default_pack_registry()
+
+
+def test_default_pack_registry_validates_without_error():
+    reset_default_pack_registry()
+    try:
+        registry = get_default_pack_registry()
+        registry.validate_all()  # should not raise
+    finally:
+        reset_default_pack_registry()
+
+
+def test_default_pack_registry_singleton_stable():
+    reset_default_pack_registry()
+    try:
+        r1 = get_default_pack_registry()
+        r2 = get_default_pack_registry()
+        assert r1 is r2
+    finally:
+        reset_default_pack_registry()
diff --git a/backend/tests/unit/test_retrieval_per_call.py b/backend/tests/unit/test_retrieval_per_call.py
new file mode 100644
index 0000000..b410d6e
--- /dev/null
+++ b/backend/tests/unit/test_retrieval_per_call.py
@@ -0,0 +1,183 @@
+"""Tests for RagCompatPack per-tool-call retrieval state (Phase 6)."""
+
+from __future__ import annotations
+
+
+from app.domain.models import AgenticGroundTruthEntry
+from app.plugins.packs.rag_compat import RagCompatPack
+
+
+def _make_item(**overrides) -> AgenticGroundTruthEntry:
+    """Create a minimal item with default fields."""
+    base = {
+        "id": "test-item",
+        "datasetName": "ds",
+        "history": [
+            {"role": "user", "msg": "hi"},
+            {"role": "assistant", "msg": "hello"},
+        ],
+    }
+    base.update(overrides)
+    return AgenticGroundTruthEntry.model_validate(base)
+
+
+def _make_item_with_refs(**overrides) -> AgenticGroundTruthEntry:
+    """Create an item with top-level refs (legacy pattern)."""
+    return _make_item(
+        refs=[
+            {"url": "https://a.com", "title": "A", "content": "chunk-a"},
+            {"url": "https://b.com", "title": "B", "content": "chunk-b"},
+        ],
+        **overrides,
+    )
+
+
+def _make_item_with_tool_calls(**overrides) -> AgenticGroundTruthEntry:
+    """Create an item with tool calls and top-level refs."""
+    return _make_item(
+        refs=[
+            {"url": "https://a.com", "title": "A", "content": "chunk-a", "messageIndex": 1},
+            {"url": "https://b.com", "title": "B", "content": "chunk-b"},
+        ],
+        toolCalls=[
+            {"id": "tc-1", "name": "search", "callType": "tool", "stepNumber": 1},
+            {"id": "tc-2", "name": "lookup", "callType": "tool", "stepNumber": 2},
+        ],
+        **overrides,
+    )
+
+
+class TestPerCallRetrievalState:
+    """Per-tool-call retrieval management on RagCompatPack."""
+
+    def test_get_retrievals_empty_item(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        assert pack.get_retrievals(item) == {}
+
+    def test_set_and_get_retrieval_candidates(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        candidates = [
+            {"url": "https://a.com", "title": "A", "chunk": "text-a"},
+        ]
+        pack.set_retrieval_candidates(item, "tc-1", candidates)
+        assert pack.get_retrieval_candidates(item, "tc-1") == candidates
+
+    def test_get_retrieval_candidates_missing_tool_call(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        assert pack.get_retrieval_candidates(item, "nonexistent") == []
+
+    def test_set_retrievals_replaces_all(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        pack.set_retrieval_candidates(item, "tc-1", [{"url": "https://a.com"}])
+        pack.set_retrievals(
+            item,
+            {
+                "tc-2": {"candidates": [{"url": "https://b.com"}]},
+            },
+        )
+        assert pack.get_retrieval_candidates(item, "tc-1") == []
+        assert len(pack.get_retrieval_candidates(item, "tc-2")) == 1
+
+    def test_has_per_call_state_false_when_empty(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        assert pack.has_per_call_state(item) is False
+
+    def test_has_per_call_state_true_after_set(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        pack.set_retrieval_candidates(item, "tc-1", [{"url": "https://a.com"}])
+        assert pack.has_per_call_state(item) is True
+
+    def test_get_all_candidates_flat_from_per_call(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        pack.set_retrieval_candidates(
+            item,
+            "tc-1",
+            [
+                {"url": "https://a.com", "title": "A"},
+            ],
+        )
+        pack.set_retrieval_candidates(
+            item,
+            "tc-2",
+            [
+                {"url": "https://b.com", "title": "B"},
+            ],
+        )
+        flat = pack.get_all_candidates_flat(item)
+        assert len(flat) == 2
+        urls = {c["url"] for c in flat}
+        assert urls == {"https://a.com", "https://b.com"}
+
+    def test_get_all_candidates_flat_falls_back_to_top_level_refs(self):
+        pack = RagCompatPack()
+        item = _make_item_with_refs()
+        flat = pack.get_all_candidates_flat(item)
+        assert len(flat) == 2
+        assert flat[0]["url"] == "https://a.com"
+        assert flat[0]["chunk"] == "chunk-a"
+
+    def test_get_all_candidates_flat_includes_tool_call_id(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        pack.set_retrieval_candidates(
+            item,
+            "tc-1",
+            [
+                {"url": "https://a.com"},
+            ],
+        )
+        flat = pack.get_all_candidates_flat(item)
+        assert flat[0]["toolCallId"] == "tc-1"
+
+
+class TestMigrateRefsToPerCall:
+    """Tests for migrate_refs_to_per_call helper."""
+
+    def test_migrate_no_refs_returns_false(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        assert pack.migrate_refs_to_per_call(item) is False
+
+    def test_migrate_already_migrated_returns_false(self):
+        pack = RagCompatPack()
+        item = _make_item()
+        pack.set_retrieval_candidates(item, "tc-1", [{"url": "https://a.com"}])
+        # Even with refs present, per-call state exists → skip migration
+        assert pack.migrate_refs_to_per_call(item) is False
+
+    def test_migrate_top_level_refs_to_unassociated(self):
+        pack = RagCompatPack()
+        item = _make_item_with_refs()
+        assert pack.migrate_refs_to_per_call(item) is True
+        # All refs go to _unassociated since no tool calls
+        cands = pack.get_retrieval_candidates(item, "_unassociated")
+        assert len(cands) == 2
+        assert cands[0]["url"] == "https://a.com"
+
+    def test_migrate_refs_matched_to_tool_calls_by_step(self):
+        pack = RagCompatPack()
+        item = _make_item_with_tool_calls()
+        assert pack.migrate_refs_to_per_call(item) is True
+
+        # Ref with messageIndex=1 matches tc-1 (stepNumber=1)
+        tc1_cands = pack.get_retrieval_candidates(item, "tc-1")
+        assert len(tc1_cands) == 1
+        assert tc1_cands[0]["url"] == "https://a.com"
+
+        # Ref without messageIndex goes to _unassociated
+        unassociated = pack.get_retrieval_candidates(item, "_unassociated")
+        assert len(unassociated) == 1
+        assert unassociated[0]["url"] == "https://b.com"
+
+    def test_migrate_idempotent(self):
+        pack = RagCompatPack()
+        item = _make_item_with_refs()
+        assert pack.migrate_refs_to_per_call(item) is True
+        assert pack.migrate_refs_to_per_call(item) is False
diff --git a/backend/tests/unit/test_search_raw_payload.py b/backend/tests/unit/test_search_raw_payload.py
new file mode 100644
index 0000000..6d00aea
--- /dev/null
+++ b/backend/tests/unit/test_search_raw_payload.py
@@ -0,0 +1,118 @@
+"""Unit tests for raw search payload preservation.
+
+Validates that SearchService.query() retains the complete provider response
+in the ``raw_payload`` field alongside the normalized url/title/chunk.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from app.services.search_service import SearchService
+
+
+# ---------------------------------------------------------------------------
+# Fake adapter
+# ---------------------------------------------------------------------------
+
+
+class FakeSearchAdapter:
+    """Returns canned results for testing."""
+
+    def __init__(self, results: list[dict]) -> None:
+        self._results = results
+
+    async def query(self, q: str, top: int = 5) -> list[dict]:
+        return self._results[:top]
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.anyio
+async def test_raw_payload_included_in_results():
+    raw_hit = {
+        "url": "https://example.com/doc1",
+        "title": "Doc 1",
+        "chunk": "Some text",
+        "score": 0.95,
+        "metadata": {"source": "index-a"},
+    }
+    service = SearchService(adapter=FakeSearchAdapter([raw_hit]))
+    results = await service.query("test query")
+
+    assert len(results) == 1
+    assert results[0]["url"] == "https://example.com/doc1"
+    assert results[0]["title"] == "Doc 1"
+    assert results[0]["chunk"] == "Some text"
+    assert "raw_payload" in results[0]
+    assert results[0]["raw_payload"]["score"] == 0.95
+    assert results[0]["raw_payload"]["metadata"] == {"source": "index-a"}
+
+
+@pytest.mark.anyio
+async def test_raw_payload_contains_full_provider_response():
+    raw_hit = {
+        "url": "https://example.com/doc2",
+        "title": "Doc 2",
+        "chunk": "Content",
+        "extra_field_1": "value1",
+        "extra_field_2": [1, 2, 3],
+        "nested": {"deep": True},
+    }
+    service = SearchService(adapter=FakeSearchAdapter([raw_hit]))
+    results = await service.query("query")
+
+    payload = results[0]["raw_payload"]
+    assert payload["extra_field_1"] == "value1"
+    assert payload["extra_field_2"] == [1, 2, 3]
+    assert payload["nested"]["deep"] is True
+
+
+@pytest.mark.anyio
+async def test_raw_payload_is_independent_copy():
+    """Mutating raw_payload should not affect the normalized fields."""
+    raw_hit = {"url": "https://example.com", "title": "T", "chunk": "C"}
+    service = SearchService(adapter=FakeSearchAdapter([raw_hit]))
+    results = await service.query("q")
+
+    results[0]["raw_payload"]["url"] = "MODIFIED"
+    assert results[0]["url"] == "https://example.com"
+
+
+@pytest.mark.anyio
+async def test_empty_results_no_payload():
+    service = SearchService(adapter=FakeSearchAdapter([]))
+    results = await service.query("q")
+    assert results == []
+
+
+@pytest.mark.anyio
+async def test_no_adapter_returns_empty():
+    service = SearchService(adapter=None)
+    results = await service.query("q")
+    assert results == []
+
+
+@pytest.mark.anyio
+async def test_configurable_field_names_with_raw_payload():
+    """When field names are remapped, raw_payload still contains the original hit."""
+    raw_hit = {
+        "document_url": "https://example.com",
+        "heading": "Title",
+        "content": "Chunk text",
+        "relevance": 0.88,
+    }
+    service = SearchService(adapter=FakeSearchAdapter([raw_hit]))
+    service.url_field = "document_url"
+    service.title_field = "heading"
+    service.chunk_field = "content"
+
+    results = await service.query("q")
+    assert results[0]["url"] == "https://example.com"
+    assert results[0]["title"] == "Title"
+    assert results[0]["chunk"] == "Chunk text"
+    assert results[0]["raw_payload"]["relevance"] == 0.88
+    assert results[0]["raw_payload"]["document_url"] == "https://example.com"
diff --git a/backend/tests/unit/test_snapshot_service.py b/backend/tests/unit/test_snapshot_service.py
index ea83ec5..9a8cfcf 100644
--- a/backend/tests/unit/test_snapshot_service.py
+++ b/backend/tests/unit/test_snapshot_service.py
@@ -10,12 +10,12 @@
 from app.exports.registry import ExportFormatterRegistry, ExportProcessorRegistry
 from app.exports.storage.local import LocalExportStorage
 from app.services.snapshot_service import SnapshotService
-from app.domain.models import GroundTruthItem
+from app.domain.models import AgenticGroundTruthEntry
 from app.domain.enums import GroundTruthStatus
 
 
 class _FakeRepo:
-    def __init__(self, items: list[GroundTruthItem]):
+    def __init__(self, items: list[AgenticGroundTruthEntry]):
         self._items = items
         self.calls: list[tuple[str, Any]] = []
 
@@ -94,8 +94,8 @@ async def list_datasets(self, *args, **kwargs):  # pragma: no cover
         raise NotImplementedError
 
 
-def _make_item(id: str, dataset: str, status: GroundTruthStatus) -> GroundTruthItem:
-    return GroundTruthItem(
+def _make_item(id: str, dataset: str, status: GroundTruthStatus) -> AgenticGroundTruthEntry:
+    return AgenticGroundTruthEntry(
         id=id,
         datasetName=dataset,
         bucket=None,
@@ -131,6 +131,31 @@ def _build_snapshot_service(repo: _FakeRepo) -> SnapshotService:
     )
 
 
+def _build_snapshot_service_with_transforms(
+    repo: _FakeRepo, *, plugin_export_transforms: list[Any]
+) -> SnapshotService:
+    storage = LocalExportStorage(base_dir=".")
+    pipeline = ExportPipeline(storage)
+    processor_registry = ExportProcessorRegistry()
+    formatter_registry = ExportFormatterRegistry()
+    formatter_registry.register(JsonItemsFormatter())
+    formatter_registry.register_factory(
+        "json_snapshot_payload",
+        lambda snapshot_at, filters=None: JsonSnapshotPayloadFormatter(
+            snapshot_at=snapshot_at,
+            filters=filters,
+        ),
+    )
+    return SnapshotService(
+        repo,
+        export_pipeline=pipeline,
+        processor_registry=processor_registry,
+        formatter_registry=formatter_registry,
+        default_processor_order=[],
+        plugin_export_transforms=plugin_export_transforms,
+    )
+
+
 @pytest.mark.anyio
 async def test_collect_approved_calls_repo_with_status():
     items = [
@@ -206,3 +231,16 @@ async def test_build_snapshot_payload_empty_list():
 
     assert payload["count"] == 0
     assert payload["items"] == []
+
+
+@pytest.mark.anyio
+async def test_build_snapshot_payload_applies_plugin_export_transforms():
+    repo = _FakeRepo([_make_item("1", "faq", GroundTruthStatus.approved)])
+    svc = _build_snapshot_service_with_transforms(
+        repo,
+        plugin_export_transforms=[lambda doc: {**doc, "pluginProjected": True}],
+    )
+
+    payload = await svc.build_snapshot_payload()
+
+    assert payload["items"][0]["pluginProjected"] is True
diff --git a/backend/tests/unit/test_trace_export_adapter.py b/backend/tests/unit/test_trace_export_adapter.py
new file mode 100644
index 0000000..2f0f8ba
--- /dev/null
+++ b/backend/tests/unit/test_trace_export_adapter.py
@@ -0,0 +1,87 @@
+from __future__ import annotations
+
+from app.plugins.adapters.trace_export import TraceExportAdapter
+
+
+def test_trace_export_adapter_maps_trace_into_agentic_ground_truth() -> None:
+    payload = {
+        "trace_count": 1,
+        "traces": [
+            {
+                "id": "trace-123",
+                "cid_list": ["conversation-456"],
+                "uid": "user-789",
+                "impacted_device_type": "MSISDN",
+                "impacted_device": "[REDACTED_MSISDN]",
+                "metric_name": "user feedback",
+                "type": "like",
+                "comment": "",
+                "additional_feedback": {
+                    "The recommended resolution was correct and appropriate": 2,
+                },
+                "resolution": "CUSTOMER WAS ON CELLULAR DATA INSTEAD OF WIFI",
+                "feedback_date": 1771405033,
+                "feedback_datetime_utc": "2026-02-18T08:57:13+00:00",
+                "chat_history": [
+                    {
+                        "user_query": "CX IS USING TOO MUCH DATA AND WANTS TO KNOW WHY",
+                        "chat_response": "Analysis shows the account remained on cellular data.",
+                        "rca": "### Root Cause\nThe plan cap was exceeded after streaming on mobile data.",
+                        "context": [
+                            {
+                                "id": "tool-1",
+                                "run_id": "run-1",
+                                "function_name": "get_plan_usage",
+                                "function_arguments": "msisdn='[REDACTED_MSISDN]' context=None",
+                                "function_result": '{"response":{"items":[{"valueObject":{"planLimitGb":50,"usageGb":63}}]}}',
+                                "execution_time": 1.83,
+                            }
+                        ],
+                    }
+                ],
+            }
+        ],
+    }
+
+    adapter = TraceExportAdapter(dataset_name="customer-feedback")
+    [item] = adapter.adapt_payload(payload)
+
+    assert item.id == "trace-trace-123"
+    assert item.datasetName == "customer-feedback"
+    assert item.scenario_id == "trace-export:trace-123"
+    assert item.synth_question == "CX IS USING TOO MUCH DATA AND WANTS TO KNOW WHY"
+    assert item.answer is not None
+    assert "Root Cause" in item.answer
+    assert item.comment == "CUSTOMER WAS ON CELLULAR DATA INSTEAD OF WIFI"
+    assert item.trace_ids == {
+        "traceId": "trace-123",
+        "conversationId": "conversation-456",
+        "userId": "user-789",
+    }
+
+    assert len(item.context_entries) >= 1
+    assert item.metadata["sourceFormat"] == "trace-export"
+    assert item.metadata["toolCallCount"] == 1
+    assert item.trace_payload["resolution"] == "CUSTOMER WAS ON CELLULAR DATA INSTEAD OF WIFI"
+
+    [tool_call] = item.tool_calls
+    assert tool_call.name == "get_plan_usage"
+    assert tool_call.arguments == {
+        "msisdn": "[REDACTED_MSISDN]",
+        "context": None,
+    }
+    assert tool_call.response == {
+        "result": {
+            "response": {
+                "items": [{"valueObject": {"planLimitGb": 50, "usageGb": 63}}],
+            }
+        },
+        "executionTimeSeconds": 1.83,
+        "runId": "run-1",
+    }
+
+    assert [entry.source for entry in item.feedback] == [
+        "trace-export-summary",
+        "trace-export-ratings",
+    ]
+    assert item.feedback[1].values["The recommended resolution was correct and appropriate"] == 2
diff --git a/backend/tests/unit/test_validation_required_tools.py b/backend/tests/unit/test_validation_required_tools.py
new file mode 100644
index 0000000..babde3e
--- /dev/null
+++ b/backend/tests/unit/test_validation_required_tools.py
@@ -0,0 +1,133 @@
+"""Unit tests for backend required-tool enforcement in approval validation.
+
+These tests were deferred from Phase 4 (frontend-only approval strictness)
+to Phase 5 where the backend side is implemented. They validate that
+collect_approval_validation_errors() enforces ≥1 required tool when tool
+calls are present, and that plugin-pack waivers can bypass this check.
+"""
+
+from __future__ import annotations
+
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+)
+from app.plugins.base import PluginPackRegistry
+from app.plugins.packs.rag_compat import RagCompatPack
+from app.services.validation_service import collect_approval_validation_errors
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+REQUIRED_TOOLS_ERROR = (
+    "expectedTools.required must include at least one tool "
+    "before approval when toolCalls are present"
+)
+
+
+def _make_item(**overrides) -> AgenticGroundTruthEntry:
+    defaults = {
+        "id": "req-tool-1",
+        "datasetName": "demo",
+        "history": [
+            {"role": "user", "msg": "Find the answer."},
+            {"role": "assistant", "msg": "I found it."},
+        ],
+    }
+    defaults.update(overrides)
+    return AgenticGroundTruthEntry.model_validate(defaults)
+
+
+# ---------------------------------------------------------------------------
+# Core required-tool enforcement
+# ---------------------------------------------------------------------------
+
+
+def test_required_tool_error_when_tool_calls_exist_but_no_required():
+    item = _make_item(toolCalls=[{"name": "search"}])
+    errors = collect_approval_validation_errors(item)
+    assert REQUIRED_TOOLS_ERROR in errors
+
+
+def test_no_required_tool_error_when_required_tools_defined():
+    item = _make_item(
+        toolCalls=[{"name": "search"}],
+        expectedTools={"required": [{"name": "search"}]},
+    )
+    errors = collect_approval_validation_errors(item)
+    assert REQUIRED_TOOLS_ERROR not in errors
+
+
+def test_no_required_tool_error_when_no_tool_calls():
+    item = _make_item()
+    errors = collect_approval_validation_errors(item)
+    assert REQUIRED_TOOLS_ERROR not in errors
+
+
+def test_missing_required_tools_detected():
+    item = _make_item(
+        toolCalls=[{"name": "search"}],
+        expectedTools={"required": [{"name": "browser"}]},
+    )
+    errors = collect_approval_validation_errors(item)
+    assert any("browser" in e for e in errors)
+
+
+def test_multiple_missing_required_tools_sorted():
+    item = _make_item(
+        toolCalls=[{"name": "search"}],
+        expectedTools={"required": [{"name": "z-tool"}, {"name": "a-tool"}]},
+    )
+    errors = collect_approval_validation_errors(item)
+    missing_error = [e for e in errors if "do not exist" in e]
+    assert len(missing_error) == 1
+    assert "a-tool, z-tool" in missing_error[0]
+
+
+# ---------------------------------------------------------------------------
+# Plugin-pack waiver for required-tools
+# ---------------------------------------------------------------------------
+
+
+def test_rag_pack_waives_required_tools_for_retrieval_items():
+    """RagCompatPack waives the required-tools check for items with refs."""
+    registry = PluginPackRegistry()
+    registry.register(RagCompatPack())
+
+    item = _make_item(
+        toolCalls=[{"name": "search"}],
+        totalReferences=3,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    assert REQUIRED_TOOLS_ERROR in core_errors
+
+    filtered = registry.filter_core_errors(item, core_errors)
+    assert REQUIRED_TOOLS_ERROR not in filtered
+
+
+def test_rag_pack_does_not_waive_required_tools_without_refs():
+    """Without refs, the required-tools error stands."""
+    registry = PluginPackRegistry()
+    registry.register(RagCompatPack())
+
+    item = _make_item(
+        toolCalls=[{"name": "search"}],
+        totalReferences=0,
+    )
+    core_errors = collect_approval_validation_errors(item)
+    filtered = registry.filter_core_errors(item, core_errors)
+    assert REQUIRED_TOOLS_ERROR in filtered
+
+
+def test_required_tools_pass_when_properly_classified():
+    """Items with classified tools have no required-tools error at all."""
+    item = _make_item(
+        toolCalls=[{"name": "search"}, {"name": "browser"}],
+        expectedTools={
+            "required": [{"name": "search"}],
+            "optional": [{"name": "browser"}],
+        },
+    )
+    errors = collect_approval_validation_errors(item)
+    assert not errors
diff --git a/backend/tests/unit/test_validation_service.py b/backend/tests/unit/test_validation_service.py
new file mode 100644
index 0000000..942eafd
--- /dev/null
+++ b/backend/tests/unit/test_validation_service.py
@@ -0,0 +1,55 @@
+from app.domain.models import (
+    AgenticGroundTruthEntry,
+    ExpectedTools,
+    ToolCallRecord,
+    ToolExpectation,
+)
+from app.services.validation_service import collect_approval_validation_errors
+
+
+def test_approval_validation_accepts_legacy_question_answer_payload():
+    item = AgenticGroundTruthEntry.model_validate(
+        {
+            "id": "item-1",
+            "datasetName": "demo",
+            "synthQuestion": "What is Ground Truth Curator?",
+            "answer": "It is a curation application.",
+        }
+    )
+
+    assert collect_approval_validation_errors(item) == []
+
+
+def test_approval_validation_requires_required_tool_when_tool_calls_exist():
+    item = AgenticGroundTruthEntry(
+        id="item-2",
+        datasetName="demo",
+        history=[
+            {"role": "user", "msg": "Find the answer."},
+            {"role": "assistant", "msg": "I found it."},
+        ],
+        toolCalls=[ToolCallRecord(name="search")],
+    )
+
+    errors = collect_approval_validation_errors(item)
+
+    assert errors == [
+        "expectedTools.required must include at least one tool before approval when toolCalls are present"
+    ]
+
+
+def test_approval_validation_requires_required_tool_to_match_tool_calls():
+    item = AgenticGroundTruthEntry(
+        id="item-3",
+        datasetName="demo",
+        history=[
+            {"role": "user", "msg": "Find the answer."},
+            {"role": "assistant", "msg": "I found it."},
+        ],
+        toolCalls=[ToolCallRecord(name="search")],
+        expectedTools=ExpectedTools(required=[ToolExpectation(name="browser")]),
+    )
+
+    errors = collect_approval_validation_errors(item)
+
+    assert errors == ["expectedTools.required references toolCalls that do not exist: browser"]
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
new file mode 100644
index 0000000..75fa773
--- /dev/null
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,146 @@
+# Architecture
+
+## Purpose
+
+Ground Truth Curator is a monorepo for curating, reviewing, and exporting high-quality ground truth items. The backend owns HTTP APIs, orchestration, and persistence. The frontend owns curator workflows, typed API access, and the browser editing experience.
+
+## Entrypoints
+
+- **Backend app:** `backend/app/main.py`
+  - `create_app()` builds the FastAPI application.
+  - Mounts the versioned API under `/v1`.
+  - Exposes `/healthz`.
+  - Installs auth, request logging, and optional harness JSONL middleware.
+  - Optionally serves the built SPA when `GTC_FRONTEND_DIR` points at a frontend dist folder.
+- **Frontend app:** `frontend/src/main.tsx`
+  - Initializes frontend telemetry.
+  - Renders `App`, which currently renders `GTAppDemo`.
+  - Uses the Vite dev server and proxies `/v1` to the backend during local development.
+
+## Boundaries
+
+| Boundary | Input | Output | Owner |
+|---|---|---|---|
+| Browser shell | Browser route and user interaction | React component tree rooted at `src/main.tsx` | `frontend/src/main.tsx`, `frontend/src/App.tsx`, `frontend/src/demo.tsx` |
+| Frontend state and workflow orchestration | UI events, runtime config, provider responses | Editable `GroundTruthItem` state, save/approve/delete actions | `frontend/src/hooks/useGroundTruth.ts` |
+| Frontend API access | Typed method calls from hooks/providers | HTTP requests to `/v1` and mapped frontend models | `frontend/src/api/`, `frontend/src/services/`, `frontend/src/adapters/` |
+| HTTP API boundary | FastAPI request, headers, path params, JSON body | Status code + JSON response under `/v1` | `backend/app/api/v1/` |
+| Service orchestration | Typed request data from routers | Business workflow calls into repositories/adapters | `backend/app/services/` |
+| Persistence and external integrations | Service requests | Ground truth records, assignments, tags, search, inference side effects | `backend/app/adapters/` |
+| Domain contracts | External data normalized at edges | Typed backend models and validators | `backend/app/domain/` |
+| Plugin enrichment | Registry-driven tag or workflow extensions | Computed tags and plugin-owned behavior | `backend/app/plugins/` |
+| Local observability mirror | Completed request context | JSONL entries in `.harness/logs.jsonl` and `.harness/traces.jsonl` | `backend/app/core/harness_observability.py` |
+
+## Request And Data Flow
+
+1. **Frontend boot**
+   - Vite loads `frontend/src/main.tsx`.
+   - Frontend telemetry initializes before the React tree renders.
+   - `App` renders `GTAppDemo`.
+2. **Provider selection**
+   - `useGroundTruth()` chooses a provider.
+   - Demo flows use `JsonProvider` when `VITE_DEMO_MODE` / `DEMO_MODE` is active in supported builds.
+   - Normal flows use `ApiProvider`.
+3. **Frontend to backend**
+   - `ApiProvider` calls typed service helpers in `frontend/src/services/`.
+   - The Vite dev server proxies `/v1` to `HARNESS_BACKEND_URL` or `http://localhost:8000`.
+   - Runtime UI switches such as approval requirements and self-serve limits come from `/v1/config`, with environment fallback when the backend is unavailable.
+4. **HTTP parse and routing**
+   - `backend/app/main.py` includes `api_router` under `settings.API_PREFIX` (default `/v1`).
+   - Route modules in `backend/app/api/v1/` own request parsing, status codes, auth dependencies, and response shapes.
+5. **Service orchestration**
+   - Route modules call container-wired services in `backend/app/services/`.
+   - Services apply workflow rules such as assignment handling, snapshot export, tagging, search, and curation behavior.
+6. **Persistence and integrations**
+   - Services call repository or adapter implementations in `backend/app/adapters/`.
+   - The local default is memory-backed data for harness-friendly smoke runs.
+   - Production-oriented integrations include Cosmos DB, Azure AI Search, Blob-backed assets, and optional LLM/inference adapters when configured.
+7. **Return path**
+   - Backend responses serialize typed models in the wire schema expected by the frontend.
+   - `ApiProvider` maps API payloads into frontend models and preserves ETag metadata for retry-on-`412` flows.
+   - The React tree updates queue state, editor state, and stats views from provider results.
+8. **Observability**
+   - When `GTC_HARNESS_JSONL_ENABLED=true`, backend middleware writes one log record and one trace record per HTTP request to `.harness/`.
+   - Deployed environments may also emit Azure Monitor / OpenTelemetry telemetry, but `.harness/*.jsonl` is the local agent-facing contract.
+
+## Guardrails
+
+- Preserve the backend layering: `api/v1 -> services -> adapters`.
+- Do not import adapters directly from FastAPI route modules.
+- Keep backend domain models in `backend/app/domain/`; do not redefine backend data contracts in route handlers.
+- Keep frontend network calls in `frontend/src/api/` or `frontend/src/services/`, not in presentational components.
+- Keep provider-specific mapping logic in `frontend/src/adapters/`; components should consume normalized frontend models.
+- Regenerate frontend API types when backend schema changes: `cd frontend && npm run api:types`.
+- Respect ETag concurrency for update paths. Backend updates can return `412`; frontend retry logic belongs in the provider/service boundary, not in components.
+- Use `/v1/config` or centralized config helpers for runtime switches instead of scattering direct environment reads through feature code.
+- Do not modify `infra/` or deployment mechanics unless the workflow itself is changing and the task explicitly asks for it.
+
+## Backend And Frontend Ownership
+
+### Backend owns
+
+- HTTP routing, auth enforcement, OpenAPI generation, and request lifecycle middleware.
+- Assignment, curation, tagging, snapshot, search, and inference orchestration.
+- Persistence backends and Azure-facing adapters.
+- Local harness JSONL emission for request logs and traces.
+
+### Frontend owns
+
+- Queue, editor, evidence review, stats, and workflow interactions.
+- Provider selection between demo and real API flows.
+- Typed API consumption and ETag-aware save behavior.
+- Client-side telemetry initialization and UI error boundaries.
+
+## Data Shape Contracts
+
+- Parse and validate external HTTP input at the API boundary.
+- Normalize backend data into typed domain models before service-layer use.
+- Keep wire-shape transformations centralized:
+  - backend: route parsing + domain models
+  - frontend: `src/adapters/apiMapper.ts`, service helpers, and provider mapping
+- Preserve the backend camelCase wire contract expected by the generated frontend API types.
+
+## Enforcing Boundaries With Static Analysis
+
+Architecture docs help humans. Static analysis keeps the guardrails enforceable.
+
+### Existing quality gates
+
+- Backend lint: `cd backend && uv run ruff check app/`
+- Backend typecheck: `cd backend && uv run ty check app/`
+- Frontend lint: `cd frontend && npm run lint:check`
+- Frontend typecheck: `cd frontend && npm run typecheck`
+- Repo wrapper: `make -f Makefile.harness check`
+
+### Concrete next enforcement to add
+
+Add a backend import-boundary contract so the repo fails fast when cross-layer shortcuts appear.
+
+Recommended contract:
+
+- `app.domain` must not import from `app.api.v1`
+- `app.services` must not import from `app.api.v1`
+- `app.adapters` must not import from `app.api.v1`
+
+Example using `import-linter`:
+
+```toml
+[tool.importlinter]
+root_packages = ["app"]
+
+[[tool.importlinter.contracts]]
+name = "Backend layers must not depend on api.v1"
+type = "forbidden"
+source_modules = ["app.domain", "app.services", "app.adapters"]
+forbidden_modules = ["app.api.v1"]
+```
+
+Wire it into the existing harness after Ruff in `scripts/harness/lint.sh` so it runs in `make -f Makefile.harness check` and `make -f Makefile.harness ci`.
+
+## Change Checklist
+
+- [ ] Boundary ownership still matches `backend/app/` and `frontend/src/`
+- [ ] API schema changes are reflected in regenerated frontend types
+- [ ] New data transformations live at boundaries, not in components or route glue
+- [ ] New cross-layer imports are covered by lint or typecheck rules
+- [ ] Observability behavior changes are documented in `docs/OBSERVABILITY.md`
diff --git a/docs/OBSERVABILITY.md b/docs/OBSERVABILITY.md
new file mode 100644
index 0000000..9d80334
--- /dev/null
+++ b/docs/OBSERVABILITY.md
@@ -0,0 +1,204 @@
+# Observability
+
+## Purpose
+
+Ground Truth Curator uses append-only JSONL files in `.harness/` as the local, agent-facing observability contract. The backend emits one structured log record and one structured trace record for each HTTP request when harness JSONL is enabled.
+
+## Files And Ownership
+
+```text
+.harness/
+  logs.jsonl    # request log lines
+  traces.jsonl  # request span records
+```
+
+- JSONL emission is implemented in `backend/app/core/harness_observability.py`.
+- Middleware installation happens in `backend/app/main.py` when `GTC_HARNESS_JSONL_ENABLED=true`.
+- `.harness/` is ephemeral local state and should not be committed.
+
+## Current Contract
+
+### `logs.jsonl`
+
+One JSON object per completed HTTP request.
+
+Example:
+
+```json
+{"ts":"2026-03-13T12:00:00+00:00","level":"INFO","msg":"GET /healthz 200","service":"gtc-api","trace_id":"3b7c...","span_id":"f83f1b02a9f54fbe","duration_ms":8,"status":"ok","method":"GET","path":"/healthz","http_status":200,"error":null}
+```
+
+### `traces.jsonl`
+
+One JSON object per completed HTTP request span.
+
+Example:
+
+```json
+{"trace_id":"3b7c...","span_id":"f83f1b02a9f54fbe","parent_id":null,"name":"GET /healthz","service":"gtc-api","start":"2026-03-13T12:00:00+00:00","end":"2026-03-13T12:00:00+00:00","duration_ms":8,"status":"ok","method":"GET","path":"/healthz","http_status":200}
+```
+
+## Required Event Fields
+
+### Required fields in `.harness/logs.jsonl`
+
+| Field | Required | Notes |
+|---|---|---|
+| `ts` | yes | ISO 8601 completion timestamp |
+| `level` | yes | `INFO`, `WARN`, or `ERROR` in the current implementation |
+| `msg` | yes | `<METHOD> <PATH> <STATUS>` |
+| `service` | yes | `settings.SERVICE_NAME` |
+| `trace_id` | yes | Correlates log and trace records |
+| `span_id` | yes | Per-request span id |
+| `duration_ms` | yes | Rounded request duration |
+| `status` | yes | `ok` for `<400`, otherwise `error` |
+| `method` | yes | HTTP verb |
+| `path` | yes | Request path |
+| `http_status` | yes | Numeric HTTP status |
+| `error` | yes | `null` on success, `"client error"` for 4xx, `"server error"` for 5xx, or exception class name for unhandled exceptions |
+
+### Required fields in `.harness/traces.jsonl`
+
+| Field | Required | Notes |
+|---|---|---|
+| `trace_id` | yes | Request correlation id |
+| `span_id` | yes | Per-request span id |
+| `parent_id` | yes | Currently always `null` |
+| `name` | yes | `<METHOD> <PATH>` |
+| `service` | yes | `settings.SERVICE_NAME` |
+| `start` | yes | ISO 8601 start timestamp |
+| `end` | yes | ISO 8601 completion timestamp |
+| `duration_ms` | yes | Rounded request duration |
+| `status` | yes | `ok` or `error` |
+| `method` | yes | HTTP verb |
+| `path` | yes | Request path |
+| `http_status` | yes | Numeric HTTP status |
+
+## Level Policy
+
+Ground Truth Curator's current HTTP logging policy comes directly from `backend/app/core/harness_observability.py`:
+
+| Condition | Level | Status | `error` field |
+|---|---|---|---|
+| HTTP `<400` | `INFO` | `ok` | `null` |
+| HTTP `400-499` | `WARN` | `error` | `"client error"` |
+| HTTP `>=500` | `ERROR` | `error` | `"server error"` |
+| Unhandled exception | `ERROR` | `error` | exception class name |
+
+Notes:
+
+- The current code treats all `<400` responses as `INFO`, so successful redirects stay out of the warning channel.
+- There is no separate slow-request warning emitter today. Slow requests are reviewed from `traces.jsonl` with `jq`.
+- Deployed environments may also emit Azure Monitor / OpenTelemetry telemetry, but `.harness/*.jsonl` is the local harness contract agents should inspect first.
+
+## Environment Toggles
+
+### Core harness switches
+
+- `GTC_HARNESS_JSONL_ENABLED=true`
+  - Installs the JSONL middleware and writes `.harness/logs.jsonl` and `.harness/traces.jsonl`.
+- `GTC_AZ_MONITOR_ENABLED=false`
+  - Used by the smoke flow to keep local verification self-contained and avoid Azure Monitor dependencies.
+
+### Backend behavior switches commonly used with the harness
+
+- `GTC_REPO_BACKEND=memory`
+  - Keeps local smoke and demo runs independent of Cosmos.
+- `GTC_DEMO_MODE=true`
+  - Enables demo data in memory-backed backend flows.
+- `GTC_DEMO_USER_ID=demo-user`
+  - Stable demo identity used by the backend.
+- `GTC_ENV_FILE=...`
+  - Layers explicit backend environment files when needed.
+
+### Frontend and local-dev switches
+
+- `VITE_DEMO_MODE=true`
+  - Makes `useGroundTruth()` choose the demo provider in supported builds.
+- `VITE_DEV_USER_ID=demo-user`
+  - Injects a dev `X-User-Id` through the frontend client.
+- `HARNESS_BACKEND_URL=http://localhost:8000`
+  - Vite proxy target for `/v1`.
+- `HARNESS_BACKEND_PORT`, `HARNESS_FRONTEND_PORT`
+  - Ports used by harness dev-up scripts.
+- `VITE_SELF_SERVE_LIMIT`, `VITE_REQUIRE_REFERENCE_VISIT`, `VITE_REQUIRE_KEY_PARAGRAPH`
+  - Runtime workflow toggles consumed from `/v1/config` with environment fallback.
+
+### Smoke-script overrides
+
+- `HARNESS_SMOKE_PORT`
+- `HARNESS_SMOKE_HEALTH_URL`
+- `HARNESS_SMOKE_URL`
+
+These allow the smoke probe to target non-default local ports or URLs without changing the script.
+
+## Smoke And Verify Commands
+
+### Smoke
+
+Run:
+
+```bash
+make -f Makefile.harness smoke
+```
+
+What it proves today:
+
+1. Starts the backend with:
+
+   ```bash
+   GTC_AZ_MONITOR_ENABLED=false \
+   GTC_HARNESS_JSONL_ENABLED=true \
+   uv run uvicorn app.main:app --host 127.0.0.1 --port "$PORT"
+   ```
+
+2. Probes:
+   - `GET /healthz`
+   - `GET /v1/openapi.json`
+3. Builds the frontend with `cd frontend && npm run build`
+4. Fails if either `.harness/logs.jsonl` or `.harness/traces.jsonl` is missing or empty
+
+### Check, CI, verify
+
+```bash
+make -f Makefile.harness check
+make -f Makefile.harness ci
+make -f Makefile.harness verify
+make -f Makefile.harness observe
+```
+
+- `check` runs repo lint and typecheck wrappers.
+- `ci` runs `smoke`, `check`, `api-check`, and `test`.
+- `verify` runs `ci` and then prints recent runtime errors and slow requests with `jq`.
+- `observe` gives a quick count of errors, slow traces, and line counts without a full test run.
+
+## jq Review Examples
+
+Use these from the repository root after a smoke, dev, or verify run.
+
+```bash
+# current verify target: recent runtime errors
+jq 'select(.level == "ERROR")' .harness/logs.jsonl | tail -5
+
+# include 4xx warnings as well
+jq 'select(.level == "WARN" or .level == "ERROR")' .harness/logs.jsonl | tail -20
+
+# current verify target: slow requests over 1s
+jq 'select(.duration_ms > 1000)' .harness/traces.jsonl | tail -5
+
+# current observe target: count runtime errors
+jq -s 'map(select(.level == "ERROR")) | length' .harness/logs.jsonl
+
+# current observe target: count traces slower than 500 ms
+jq -s 'map(select(.duration_ms > 500)) | length' .harness/traces.jsonl
+
+# inspect a single request across both files
+jq --arg tid "<trace_id>" 'select(.trace_id == $tid)' .harness/logs.jsonl .harness/traces.jsonl
+```
+
+## Review Expectations For Agents
+
+- After backend or frontend changes, prefer `make -f Makefile.harness verify`.
+- If `verify` is too heavy for the current task, run `make -f Makefile.harness smoke` and inspect `.harness/*.jsonl` directly.
+- Treat `WARN` records as user-visible workflow failures or contract mismatches worth reviewing, not just infrastructure noise.
+- Treat slow traces as performance regressions even when logs contain no `ERROR` records.
diff --git a/docs/design/frontend-runtime-configuration.md b/docs/design/frontend-runtime-configuration.md
index 1e4b8d5..ebea157 100644
--- a/docs/design/frontend-runtime-configuration.md
+++ b/docs/design/frontend-runtime-configuration.md
@@ -2,7 +2,7 @@
 
 ## Overview
 
-The Ground Truth Curator application supports configurable validation rules via **runtime configuration**. The frontend fetches configuration from the backend's `/v1/config` endpoint on startup, allowing validation rules to be changed without rebuilding the frontend.
+The Ground Truth Curator application supports runtime configuration for the generic host plus plugin-owned evidence workflows. The frontend fetches configuration from the backend's `/v1/config` endpoint on startup so environment-specific approval and evidence-review rules can change without rebuilding the frontend.
 
 ## Configuration Architecture
 
@@ -52,21 +52,23 @@ VITE_SELF_SERVE_LIMIT=10
 
 **Note:** In production, the backend configuration always takes precedence.
 
-## Validation Rules
+## Validation Rules and Evidence Surfaces
+
+Runtime config controls shared host behavior, but reference-specific rules apply only to workflows that still expose compatibility search/evidence or to plugin surfaces that opt into the same gating contract. The generic multi-turn host remains conversation- and expected-tools-driven.
 
 ### Reference Visit Requirement
 
 **Environment Variable:** `VITE_REQUIRE_REFERENCE_VISIT` / `GTC_REQUIRE_REFERENCE_VISIT`  
 **Default:** `true`  
-**Applies to:** Both single-turn and multi-turn items
+**Applies to:** RAG-compat evidence workflows and any plugin surface that explicitly opts into visit gating
 
 When enabled (`true`):
-- All references must be opened/visited before approval
+- References governed by the active compat/plugin workflow must be opened/visited before approval
 - The "Needs visit" indicator appears for unvisited references
 - Approval is blocked until all references have `visitedAt` timestamp
 
 When disabled (`false`):
-- References can be approved without being visited
+- Plugin-owned evidence can be approved without visit gating when the active workflow does not require it
 - Visit status is tracked but not required for approval
 - Useful for bulk imports or when references are pre-validated
 
@@ -74,14 +76,14 @@ When disabled (`false`):
 
 **Environment Variable:** `VITE_REQUIRE_KEY_PARAGRAPH` / `GTC_REQUIRE_KEY_PARAGRAPH`  
 **Default:** `false`  
-**Applies to:** Both single-turn and multi-turn items
+**Applies to:** RAG-compat evidence workflows and any plugin surface that explicitly opts into key-paragraph gating
 
 When enabled (`true`):
-- Selected references must have key paragraphs ≥40 characters
+- Selected references in the active compat/plugin workflow must have key paragraphs ≥40 characters
 - Approval is blocked until all selected references have adequate key paragraphs
 
 When disabled (`false`):
-- Key paragraphs are optional but recommended
+- Key paragraphs are optional unless the active compat/plugin workflow requires them
 - Approval can proceed without key paragraphs
 - Useful for workflows where key paragraphs are added in a separate pass
 
@@ -186,24 +188,30 @@ Use case: Ensure references are reviewed but allow flexibility on key paragraphs
 
 ### Validation Logic
 
-The validation logic is implemented in:
-
-- `frontend/src/models/validators.ts` - `refsApprovalReady()` for single-turn
-- `frontend/src/models/gtHelpers.ts` - `canApproveMultiTurn()` for multi-turn
+The validation logic is split intentionally:
 
-Both functions use `getCachedConfig()` to determine whether to enforce:
+- `frontend/src/models/validators.ts` - conversation integrity and reference-compat helpers
+- `frontend/src/models/gtHelpers.ts` - generic approval plus plugin/compat bypass logic
+- `frontend/src/components/app/pages/ReferencesSection.tsx` - generic evidence/review host that decides whether the compatibility search surface is shown
 
-1. Reference visit requirement
-2. Key paragraph requirement for relevant references
+Runtime config only controls the reference-specific branch. Generic multi-turn approval remains conversation- and expected-tools-driven, while plugin-owned evidence panels decide whether to honor these shared reference rules.
 
 ### UI Components
 
-The following components have been updated to reflect configurable validation:
+The following components reflect configurable evidence validation:
 
+- `frontend/src/components/app/pages/ReferencesSection.tsx`
 - `frontend/src/components/app/ReferencesPanel/SelectedTab.tsx`
 - `frontend/src/components/app/editor/TurnReferencesModal.tsx`
 
-Help text and indicators now say "may be required based on configuration" rather than stating absolute requirements.
+Help text and indicators now say "may be required based on configuration" rather than stating absolute requirements, and the shared host only renders search when the current workflow still uses the compatibility surface.
+
+## Phase 1 Migration Inventory
+
+- Keep: runtime delivery of reference-specific configuration.
+- Rewrite: wording that implies every multi-turn workflow is permanently reference-first.
+- Narrow: compatibility-focused UI help text once plugin-owned evidence panels replace the shared references mental model.
+- Delete with shim: docs that only describe top-level reference approval after the legacy RAG path is retired.
 
 ## Local Development Setup
 
diff --git a/docs/development/testing.md b/docs/development/testing.md
index 5a50c29..1f422da 100644
--- a/docs/development/testing.md
+++ b/docs/development/testing.md
@@ -11,9 +11,15 @@ The backend uses pytest for unit and integration testing.
 ```bash
 cd backend
 
+# Run backend integration tests from the repo root
+make -f Makefile.harness backend-integration-test
+
 # Run all unit tests
 uv run pytest tests/unit/ -v
 
+# Run all integration tests directly
+uv run pytest tests/integration/ -v
+
 # Run specific test file
 uv run pytest tests/unit/test_dos_prevention.py -v
 
diff --git a/docs/diagrams/gtc-extensibility-map.html b/docs/diagrams/gtc-extensibility-map.html
new file mode 100644
index 0000000..21dbff6
--- /dev/null
+++ b/docs/diagrams/gtc-extensibility-map.html
@@ -0,0 +1,1247 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>Ground Truth Curator — Extensibility Map</title>
+<link rel="preconnect" href="https://fonts.googleapis.com">
+<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+<link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;500;600;700&family=Fira+Code:wght@400;500;600&display=swap" rel="stylesheet">
+<style>
+  /* ============ THEME — Blueprint ============ */
+  :root {
+    --font-body: 'DM Sans', system-ui, sans-serif;
+    --font-mono: 'Fira Code', 'SF Mono', Consolas, monospace;
+
+    --bg: #f4f6f9;
+    --surface: #ffffff;
+    --surface2: #edf0f5;
+    --surface-elevated: #fbfcfe;
+    --surface-recessed: #e8ecf2;
+    --border: rgba(0, 0, 0, 0.07);
+    --border-bright: rgba(0, 0, 0, 0.15);
+    --text: #1a2332;
+    --text-dim: #6b7a8d;
+    --text-muted: #9aa8b8;
+
+    /* Teal + Slate + Gold palette */
+    --blue: #0369a1;
+    --blue-dim: rgba(3, 105, 161, 0.08);
+    --teal: #0891b2;
+    --teal-dim: rgba(8, 145, 178, 0.08);
+    --gold: #b45309;
+    --gold-dim: rgba(180, 83, 9, 0.08);
+    --sage: #4d7c0f;
+    --sage-dim: rgba(77, 124, 15, 0.08);
+    --rose: #be123c;
+    --rose-dim: rgba(190, 18, 60, 0.08);
+    --slate: #475569;
+    --slate-dim: rgba(71, 85, 105, 0.08);
+    --copper: #9a3412;
+    --copper-dim: rgba(154, 52, 18, 0.08);
+
+    /* Status */
+    --active: #059669;
+    --active-dim: rgba(5, 150, 105, 0.1);
+    --dormant: #d97706;
+    --dormant-dim: rgba(217, 119, 6, 0.1);
+    --unused: #9ca3af;
+    --unused-dim: rgba(156, 163, 175, 0.1);
+  }
+
+  @media (prefers-color-scheme: dark) {
+    :root {
+      --bg: #0f1620;
+      --surface: #182030;
+      --surface2: #1e2a3a;
+      --surface-elevated: #243348;
+      --surface-recessed: #141c28;
+      --border: rgba(255, 255, 255, 0.06);
+      --border-bright: rgba(255, 255, 255, 0.12);
+      --text: #e2e8f0;
+      --text-dim: #8899ad;
+      --text-muted: #5a6b7d;
+
+      --blue: #38bdf8;
+      --blue-dim: rgba(56, 189, 248, 0.12);
+      --teal: #22d3ee;
+      --teal-dim: rgba(34, 211, 238, 0.1);
+      --gold: #fbbf24;
+      --gold-dim: rgba(251, 191, 36, 0.1);
+      --sage: #a3e635;
+      --sage-dim: rgba(163, 230, 53, 0.1);
+      --rose: #fb7185;
+      --rose-dim: rgba(251, 113, 133, 0.1);
+      --slate: #94a3b8;
+      --slate-dim: rgba(148, 163, 184, 0.1);
+      --copper: #fb923c;
+      --copper-dim: rgba(251, 146, 60, 0.1);
+
+      --active: #34d399;
+      --active-dim: rgba(52, 211, 153, 0.12);
+      --dormant: #fbbf24;
+      --dormant-dim: rgba(251, 191, 36, 0.12);
+      --unused: #6b7280;
+      --unused-dim: rgba(107, 114, 128, 0.12);
+    }
+  }
+
+  /* ============ RESET ============ */
+  * { margin: 0; padding: 0; box-sizing: border-box; }
+
+  body {
+    background: var(--bg);
+    background-image:
+      radial-gradient(ellipse at 15% 5%, var(--blue-dim) 0%, transparent 45%),
+      radial-gradient(ellipse at 85% 95%, var(--teal-dim) 0%, transparent 35%);
+    color: var(--text);
+    font-family: var(--font-body);
+    padding: 40px;
+    min-height: 100vh;
+  }
+
+  /* ============ RESPONSIVE NAV ============ */
+  .wrap {
+    max-width: 1400px;
+    margin: 0 auto;
+    display: grid;
+    grid-template-columns: 180px 1fr;
+    gap: 0 40px;
+  }
+  .main { min-width: 0; }
+
+  .toc {
+    position: sticky;
+    top: 24px;
+    align-self: start;
+    padding: 14px 0;
+    grid-row: 1 / -1;
+    max-height: calc(100dvh - 48px);
+    overflow-y: auto;
+  }
+  .toc::-webkit-scrollbar { width: 3px; }
+  .toc::-webkit-scrollbar-thumb { background: var(--surface-elevated); border-radius: 2px; }
+
+  .toc-title {
+    font-family: var(--font-mono);
+    font-size: 9px;
+    font-weight: 700;
+    text-transform: uppercase;
+    letter-spacing: 2px;
+    color: var(--text-dim);
+    padding: 0 0 10px;
+    margin-bottom: 8px;
+    border-bottom: 1px solid var(--border);
+  }
+
+  .toc a {
+    display: block;
+    font-size: 11px;
+    color: var(--text-dim);
+    text-decoration: none;
+    padding: 5px 8px;
+    border-radius: 5px;
+    border-left: 2px solid transparent;
+    transition: all 0.15s;
+    line-height: 1.4;
+    margin-bottom: 2px;
+  }
+  .toc a:hover { color: var(--text); background: var(--surface2); }
+  .toc a.active { color: var(--text); border-left-color: var(--blue); }
+
+  .toc .toc-group {
+    font-family: var(--font-mono);
+    font-size: 8px;
+    font-weight: 600;
+    text-transform: uppercase;
+    letter-spacing: 1.5px;
+    color: var(--text-muted);
+    padding: 10px 8px 4px;
+  }
+
+  @media (max-width: 1000px) {
+    .wrap { grid-template-columns: 1fr; padding-top: 0; }
+    body { padding: 20px; padding-top: 0; }
+
+    .toc {
+      position: sticky;
+      top: 0;
+      z-index: 200;
+      max-height: none;
+      display: flex;
+      gap: 4px;
+      align-items: center;
+      overflow-x: auto;
+      -webkit-overflow-scrolling: touch;
+      background: var(--bg);
+      border-bottom: 1px solid var(--border);
+      padding: 10px 0;
+      margin: 0 -20px;
+      padding-left: 20px;
+      padding-right: 20px;
+      grid-row: auto;
+    }
+    .toc::-webkit-scrollbar { display: none; }
+    .toc-title { display: none; }
+    .toc .toc-group { display: none; }
+
+    .toc a {
+      white-space: nowrap;
+      flex-shrink: 0;
+      border-left: none;
+      border-bottom: 2px solid transparent;
+      border-radius: 4px 4px 0 0;
+      padding: 6px 10px;
+      font-size: 10px;
+    }
+    .toc a.active {
+      border-left: none;
+      border-bottom-color: var(--blue);
+      background: var(--surface);
+    }
+    .main { padding-top: 20px; }
+    .sec-head { scroll-margin-top: 52px; }
+  }
+
+  /* ============ ANIMATION ============ */
+  @keyframes fadeUp {
+    from { opacity: 0; transform: translateY(14px); }
+    to { opacity: 1; transform: translateY(0); }
+  }
+  @keyframes fadeScale {
+    from { opacity: 0; transform: scale(0.95); }
+    to { opacity: 1; transform: scale(1); }
+  }
+
+  .anim {
+    animation: fadeUp 0.4s ease-out both;
+    animation-delay: calc(var(--i, 0) * 0.05s);
+  }
+  .anim-scale {
+    animation: fadeScale 0.35s ease-out both;
+    animation-delay: calc(var(--i, 0) * 0.05s);
+  }
+
+  @media (prefers-reduced-motion: reduce) {
+    *, *::before, *::after {
+      animation-duration: 0.01ms !important;
+      animation-delay: 0ms !important;
+      transition-duration: 0.01ms !important;
+    }
+  }
+
+  /* ============ TYPOGRAPHY ============ */
+  h1 {
+    font-size: 36px;
+    font-weight: 700;
+    letter-spacing: -1px;
+    margin-bottom: 4px;
+    text-wrap: balance;
+    line-height: 1.1;
+  }
+  h1 span { color: var(--blue); }
+
+  .subtitle {
+    color: var(--text-dim);
+    font-size: 13px;
+    font-family: var(--font-mono);
+    margin-bottom: 32px;
+    font-weight: 400;
+  }
+
+  .sec-head {
+    font-family: var(--font-mono);
+    font-size: 12px;
+    font-weight: 600;
+    text-transform: uppercase;
+    letter-spacing: 1.5px;
+    padding: 20px 0 16px;
+    display: flex;
+    align-items: center;
+    gap: 10px;
+  }
+  .sec-head .dot {
+    width: 8px;
+    height: 8px;
+    border-radius: 50%;
+    display: inline-block;
+  }
+
+  /* ============ OVERVIEW KPIs ============ */
+  .kpi-row {
+    display: grid;
+    grid-template-columns: repeat(3, 1fr);
+    gap: 14px;
+    margin-bottom: 28px;
+  }
+  .kpi {
+    background: var(--surface);
+    border: 1px solid var(--border);
+    border-radius: 10px;
+    padding: 16px 20px;
+    text-align: center;
+  }
+  .kpi-num {
+    font-size: 32px;
+    font-weight: 700;
+    font-family: var(--font-mono);
+    line-height: 1;
+    margin-bottom: 4px;
+  }
+  .kpi-label {
+    font-size: 11px;
+    color: var(--text-dim);
+    font-family: var(--font-mono);
+    text-transform: uppercase;
+    letter-spacing: 1px;
+  }
+
+  @media (max-width: 640px) {
+    .kpi-row { grid-template-columns: 1fr; }
+  }
+
+  /* ============ CARDS ============ */
+  .ve-card {
+    background: var(--surface);
+    border: 1px solid var(--border);
+    border-radius: 12px;
+    padding: 20px 24px;
+    margin-bottom: 16px;
+  }
+  .ve-card--hero {
+    background: var(--surface-elevated);
+    box-shadow: 0 4px 24px rgba(0, 0, 0, 0.06);
+    padding: 24px 28px;
+  }
+  .ve-card--recessed {
+    background: var(--surface-recessed);
+    box-shadow: inset 0 1px 3px rgba(0, 0, 0, 0.04);
+  }
+
+  .card-header {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+    margin-bottom: 14px;
+  }
+  .card-icon {
+    width: 36px;
+    height: 36px;
+    border-radius: 8px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    font-size: 14px;
+    font-family: var(--font-mono);
+    font-weight: 600;
+    flex-shrink: 0;
+  }
+  .card-title {
+    font-size: 15px;
+    font-weight: 600;
+    line-height: 1.3;
+  }
+  .card-title small {
+    display: block;
+    font-size: 11px;
+    font-weight: 400;
+    color: var(--text-dim);
+    font-family: var(--font-mono);
+    margin-top: 2px;
+  }
+
+  .card-body {
+    font-size: 13px;
+    line-height: 1.7;
+    color: var(--text-dim);
+  }
+  .card-body strong { color: var(--text); }
+
+  /* ============ STATUS BADGES ============ */
+  .badge {
+    display: inline-flex;
+    align-items: center;
+    gap: 5px;
+    font-size: 10px;
+    font-family: var(--font-mono);
+    font-weight: 500;
+    padding: 3px 8px;
+    border-radius: 4px;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+  }
+  .badge::before {
+    content: '';
+    width: 6px;
+    height: 6px;
+    border-radius: 50%;
+  }
+  .badge--active { background: var(--active-dim); color: var(--active); }
+  .badge--active::before { background: var(--active); }
+  .badge--dormant { background: var(--dormant-dim); color: var(--dormant); }
+  .badge--dormant::before { background: var(--dormant); }
+  .badge--unused { background: var(--unused-dim); color: var(--unused); }
+  .badge--unused::before { background: var(--unused); }
+
+  /* ============ DISCOVERY BADGE ============ */
+  .disc-badge {
+    display: inline-block;
+    font-size: 9px;
+    font-family: var(--font-mono);
+    font-weight: 600;
+    padding: 2px 7px;
+    border-radius: 3px;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+  }
+  .disc-badge--auto { background: var(--teal-dim); color: var(--teal); }
+  .disc-badge--manual { background: var(--gold-dim); color: var(--gold); }
+  .disc-badge--sideeffect { background: var(--sage-dim); color: var(--sage); }
+
+  /* ============ INNER GRID ============ */
+  .inner-grid {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 10px;
+  }
+  .inner-grid--3 {
+    grid-template-columns: 1fr 1fr 1fr;
+  }
+
+  .inner-chip {
+    background: var(--surface2);
+    border: 1px solid var(--border);
+    border-radius: 6px;
+    padding: 8px 12px;
+    font-size: 12px;
+  }
+  .inner-chip .chip-title {
+    font-weight: 600;
+    font-size: 12px;
+    margin-bottom: 2px;
+  }
+  .inner-chip .chip-desc {
+    color: var(--text-dim);
+    font-size: 11px;
+    line-height: 1.4;
+  }
+  .inner-chip code {
+    font-family: var(--font-mono);
+    font-size: 10px;
+    background: var(--blue-dim);
+    color: var(--blue);
+    padding: 1px 5px;
+    border-radius: 3px;
+  }
+
+  @media (max-width: 640px) {
+    .inner-grid, .inner-grid--3 { grid-template-columns: 1fr; }
+  }
+
+  /* ============ TAG LIST ============ */
+  .tag-list {
+    display: flex;
+    flex-wrap: wrap;
+    gap: 6px;
+    margin-top: 10px;
+  }
+  .tag {
+    font-family: var(--font-mono);
+    font-size: 10px;
+    padding: 3px 8px;
+    border-radius: 4px;
+    background: var(--surface2);
+    border: 1px solid var(--border);
+    color: var(--text-dim);
+  }
+
+  /* ============ INTERFACE BLOCK ============ */
+  .iface {
+    background: var(--surface2);
+    border: 1px solid var(--border);
+    border-radius: 8px;
+    padding: 14px 18px;
+    margin-top: 12px;
+    font-family: var(--font-mono);
+    font-size: 11px;
+    line-height: 1.8;
+    white-space: pre-wrap;
+    overflow-x: auto;
+  }
+  .iface .kw { color: var(--blue); font-weight: 600; }
+  .iface .prop { color: var(--text); }
+  .iface .type { color: var(--teal); }
+  .iface .comment { color: var(--text-muted); font-style: italic; }
+
+  /* ============ HOW-TO STEPS ============ */
+  .steps {
+    counter-reset: step;
+    list-style: none;
+    padding: 0;
+    margin-top: 10px;
+  }
+  .steps li {
+    counter-increment: step;
+    padding: 6px 0 6px 32px;
+    position: relative;
+    font-size: 12px;
+    line-height: 1.6;
+    color: var(--text-dim);
+  }
+  .steps li::before {
+    content: counter(step);
+    position: absolute;
+    left: 0;
+    top: 6px;
+    width: 20px;
+    height: 20px;
+    background: var(--blue-dim);
+    color: var(--blue);
+    border-radius: 50%;
+    font-family: var(--font-mono);
+    font-size: 10px;
+    font-weight: 600;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+  }
+  .steps li code {
+    font-family: var(--font-mono);
+    font-size: 10px;
+    background: var(--blue-dim);
+    color: var(--blue);
+    padding: 1px 5px;
+    border-radius: 3px;
+  }
+
+  /* ============ FLOW ARROW ============ */
+  .flow-arrow {
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    gap: 8px;
+    color: var(--text-muted);
+    font-family: var(--font-mono);
+    font-size: 11px;
+    padding: 6px 0;
+  }
+  .flow-arrow svg {
+    width: 18px;
+    height: 18px;
+    fill: none;
+    stroke: var(--border-bright);
+    stroke-width: 2;
+    stroke-linecap: round;
+    stroke-linejoin: round;
+  }
+
+  /* ============ CONNECTOR PIPELINE ============ */
+  .ext-pipeline {
+    display: flex;
+    align-items: center;
+    gap: 0;
+    overflow-x: auto;
+    padding: 4px 0;
+  }
+  .ext-step {
+    background: var(--surface);
+    border: 1px solid var(--border);
+    border-radius: 8px;
+    padding: 8px 14px;
+    text-align: center;
+    min-width: 110px;
+    flex-shrink: 0;
+  }
+  .ext-step .step-label {
+    font-family: var(--font-mono);
+    font-size: 10px;
+    font-weight: 600;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+    margin-bottom: 3px;
+  }
+  .ext-step .step-file {
+    font-size: 10px;
+    color: var(--text-dim);
+  }
+  .ext-arrow {
+    color: var(--text-muted);
+    font-size: 14px;
+    padding: 0 3px;
+    flex-shrink: 0;
+  }
+
+  /* ============ SECTION DIVIDER ============ */
+  .divider {
+    height: 1px;
+    background: linear-gradient(to right, transparent, var(--border-bright), transparent);
+    margin: 32px 0;
+  }
+
+  /* ============ LEGEND ============ */
+  .legend {
+    display: flex;
+    gap: 20px;
+    flex-wrap: wrap;
+    margin: 16px 0;
+  }
+  .legend-item {
+    display: flex;
+    align-items: center;
+    gap: 6px;
+    font-size: 11px;
+    color: var(--text-dim);
+    font-family: var(--font-mono);
+  }
+  .legend-swatch {
+    width: 12px;
+    height: 12px;
+    border-radius: 3px;
+  }
+
+  /* ============ CALLOUT ============ */
+  .callout {
+    background: var(--surface2);
+    border: 1px solid var(--border);
+    border-left: 3px solid var(--blue);
+    border-radius: 0 8px 8px 0;
+    padding: 14px 18px;
+    font-size: 12px;
+    line-height: 1.6;
+    color: var(--text-dim);
+    margin-top: 12px;
+  }
+  .callout strong { color: var(--text); font-weight: 600; }
+  .callout code {
+    font-family: var(--font-mono);
+    font-size: 10px;
+    background: var(--blue-dim);
+    color: var(--blue);
+    padding: 1px 5px;
+    border-radius: 3px;
+  }
+
+  /* ============ TWO-COL LAYOUT ============ */
+  .two-col {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 16px;
+  }
+  @media (max-width: 768px) {
+    .two-col { grid-template-columns: 1fr; }
+  }
+</style>
+</head>
+<body>
+
+<div class="wrap">
+
+  <nav class="toc" id="toc">
+    <div class="toc-title">Extensibility</div>
+    <a href="#overview">Overview</a>
+    <div class="toc-group">Backend</div>
+    <a href="#computed-tags">Computed Tags</a>
+    <a href="#plugin-packs">Plugin Packs</a>
+    <a href="#trace-adapters">Trace Adapters</a>
+    <a href="#export-registry">Export Registry</a>
+    <div class="toc-group">Frontend</div>
+    <a href="#toolcall-ext">Tool-Call Extensions</a>
+    <a href="#explorer-ext">Explorer Extensions</a>
+    <div class="toc-group">Infrastructure</div>
+    <a href="#wiring">DI &amp; Wiring</a>
+    <a href="#howto">How to Extend</a>
+  </nav>
+
+  <div class="main">
+
+    <h1 class="anim" style="--i:0">Extensibility <span>Map</span></h1>
+    <p class="subtitle anim" style="--i:1">ground truth curator · 6 extension mechanisms · backend &amp; frontend</p>
+
+    <!-- ===== OVERVIEW ===== -->
+    <div id="overview" class="sec-head anim" style="--i:2; color: var(--blue)">
+      <span class="dot" style="background: var(--blue)"></span> Overview
+    </div>
+
+    <div class="kpi-row">
+      <div class="kpi anim-scale" style="--i:3">
+        <div class="kpi-num" style="color: var(--blue)">6</div>
+        <div class="kpi-label">Extension Mechanisms</div>
+      </div>
+      <div class="kpi anim-scale" style="--i:4">
+        <div class="kpi-num" style="color: var(--teal)">12</div>
+        <div class="kpi-label">Computed Tag Plugins</div>
+      </div>
+      <div class="kpi anim-scale" style="--i:5">
+        <div class="kpi-num" style="color: var(--gold)">4</div>
+        <div class="kpi-label">Backend + Frontend</div>
+      </div>
+    </div>
+
+    <div class="ve-card ve-card--hero anim" style="--i:6">
+      <div class="card-body" style="margin-bottom: 14px">
+        <strong>Ground Truth Curator</strong> is designed around pluggable registries for domain-specific behavior, computed metadata, data import/export, and UI rendering. Each mechanism has its own discovery strategy and lifecycle.
+      </div>
+      <div class="legend">
+        <div class="legend-item"><div class="legend-swatch" style="background: var(--active); border: 1px solid var(--active)"></div> Active</div>
+        <div class="legend-item"><div class="legend-swatch" style="background: var(--dormant); border: 1px solid var(--dormant)"></div> Partial / Dormant</div>
+        <div class="legend-item"><div class="legend-swatch" style="background: var(--unused); border: 1px solid var(--unused)"></div> Defined, Not Consumed</div>
+      </div>
+      <div class="legend" style="margin-top: 8px">
+        <div class="legend-item"><span class="disc-badge disc-badge--auto">auto-discover</span></div>
+        <div class="legend-item"><span class="disc-badge disc-badge--manual">manual register</span></div>
+        <div class="legend-item"><span class="disc-badge disc-badge--sideeffect">module side-effect</span></div>
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== COMPUTED TAGS ===== -->
+    <div id="computed-tags" class="sec-head anim" style="--i:7; color: var(--teal)">
+      <span class="dot" style="background: var(--teal)"></span> 1 · Computed Tag Plugins
+    </div>
+
+    <div class="ve-card anim" style="--i:8; border-color: var(--teal-dim)">
+      <div class="card-header">
+        <div class="card-icon" style="background: var(--teal-dim); color: var(--teal)">CT</div>
+        <div class="card-title">
+          ComputedTagPlugin
+          <small>backend/app/plugins/computed_tags/ · Auto-discovered</small>
+        </div>
+        <span class="badge badge--active">Active</span>
+        <span class="disc-badge disc-badge--auto">auto-discover</span>
+      </div>
+      <div class="card-body">
+        <strong>The most mature plugin system.</strong> Computed tags are automatically generated metadata derived from item content. Plugins are auto-discovered by scanning the <code>computed_tags/</code> package — any Python module containing a <code>ComputedTagPlugin</code> subclass is registered at startup.
+      </div>
+
+      <div class="iface">
+<span class="kw">class</span> <span class="prop">ComputedTagPlugin</span>:
+    <span class="prop">tag_key</span>: <span class="type">str</span>          <span class="comment"># e.g. "answer:no_answer" or "dataset:_dynamic"</span>
+    <span class="kw">def</span> <span class="prop">compute</span>(doc) → <span class="type">str | None</span>  <span class="comment"># return tag value or None to skip</span></div>
+
+      <div style="margin-top: 14px; font-size: 12px; color: var(--text-dim); margin-bottom: 8px; font-weight: 600">12 Built-in Plugins</div>
+      <div class="inner-grid--3 inner-grid">
+        <div class="inner-chip">
+          <div class="chip-title">NoAnswerPlugin</div>
+          <div class="chip-desc">Tags items where answer is <code>no_answer</code></div>
+        </div>
+        <div class="inner-chip">
+          <div class="chip-title">QuestionLength × 3</div>
+          <div class="chip-desc">Short (≤10), Medium (11–30), Long (>30 words)</div>
+        </div>
+        <div class="inner-chip">
+          <div class="chip-title">RetrievalBehavior × 4</div>
+          <div class="chip-desc">No refs, Single, Two refs, Rich (3+)</div>
+        </div>
+        <div class="inner-chip">
+          <div class="chip-title">SingleTurn / MultiTurn</div>
+          <div class="chip-desc">Classifies by history entry count (≤2 vs >2)</div>
+        </div>
+        <div class="inner-chip">
+          <div class="chip-title">DatasetPlugin</div>
+          <div class="chip-desc">Dynamic tag: <code>dataset:{name}</code></div>
+        </div>
+        <div class="inner-chip">
+          <div class="chip-title">ReferenceType × 2</div>
+          <div class="chip-desc">Article (<code>CS\d+</code>) or Helpcenter (<code>/help</code>)</div>
+        </div>
+      </div>
+
+      <div class="tag-list">
+        <span class="tag">answer:no_answer</span>
+        <span class="tag">question_length:short</span>
+        <span class="tag">question_length:medium</span>
+        <span class="tag">question_length:long</span>
+        <span class="tag">retrieval_behavior:no_refs</span>
+        <span class="tag">retrieval_behavior:single</span>
+        <span class="tag">retrieval_behavior:two_refs</span>
+        <span class="tag">retrieval_behavior:rich</span>
+        <span class="tag">turns:singleturn</span>
+        <span class="tag">turns:multiturn</span>
+        <span class="tag">dataset:_dynamic</span>
+        <span class="tag">reference_type:article</span>
+        <span class="tag">reference_type:helpcenter</span>
+      </div>
+    </div>
+
+    <div class="ve-card ve-card--recessed anim" style="--i:9">
+      <div class="card-body">
+        <strong>Discovery pipeline:</strong> <code>pkgutil.iter_modules</code> scans the directory → <code>importlib.import_module</code> loads each → <code>inspect.getmembers</code> finds <code>ComputedTagPlugin</code> subclasses → instantiates with zero args → registers in <code>TagPluginRegistry</code>.
+      </div>
+      <div class="ext-pipeline" style="margin-top: 10px">
+        <div class="ext-step" style="border-color: var(--teal-dim)">
+          <div class="step-label" style="color: var(--teal)">Scan</div>
+          <div class="step-file">computed_tags/</div>
+        </div>
+        <div class="ext-arrow">→</div>
+        <div class="ext-step" style="border-color: var(--teal-dim)">
+          <div class="step-label" style="color: var(--teal)">Import</div>
+          <div class="step-file">importlib</div>
+        </div>
+        <div class="ext-arrow">→</div>
+        <div class="ext-step" style="border-color: var(--teal-dim)">
+          <div class="step-label" style="color: var(--teal)">Inspect</div>
+          <div class="step-file">find subclasses</div>
+        </div>
+        <div class="ext-arrow">→</div>
+        <div class="ext-step" style="border-color: var(--teal-dim)">
+          <div class="step-label" style="color: var(--teal)">Instantiate</div>
+          <div class="step-file">cls()</div>
+        </div>
+        <div class="ext-arrow">→</div>
+        <div class="ext-step" style="border-color: var(--teal-dim)">
+          <div class="step-label" style="color: var(--teal)">Register</div>
+          <div class="step-file">TagPluginRegistry</div>
+        </div>
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== PLUGIN PACKS ===== -->
+    <div id="plugin-packs" class="sec-head anim" style="--i:10; color: var(--gold)">
+      <span class="dot" style="background: var(--gold)"></span> 2 · Plugin Packs
+    </div>
+
+    <div class="ve-card anim" style="--i:11; border-color: var(--gold-dim)">
+      <div class="card-header">
+        <div class="card-icon" style="background: var(--gold-dim); color: var(--gold)">PP</div>
+        <div class="card-title">
+          PluginPack
+          <small>backend/app/plugins/packs/ · Manually registered</small>
+        </div>
+        <span class="badge badge--dormant">Partial</span>
+        <span class="disc-badge disc-badge--manual">manual register</span>
+      </div>
+      <div class="card-body">
+        <strong>Named bundles of domain behavior</strong> that layer on the generic host. Packs are the broadest extension point, offering 7 hookable surfaces. Currently only <code>RagCompatPack</code> is registered, and only approval hooks + stats contribution are wired at runtime.
+      </div>
+
+      <div class="iface">
+<span class="kw">class</span> <span class="prop">PluginPack</span>:
+    <span class="prop">name</span>: <span class="type">str</span>
+    <span class="kw">def</span> <span class="prop">validate_registration</span>()         <span class="comment"># startup contract check</span>
+    <span class="kw">def</span> <span class="prop">collect_approval_errors</span>()       <span class="comment"># approval-time validation</span>
+    <span class="kw">def</span> <span class="prop">collect_approval_waivers</span>()      <span class="comment"># waive generic rules</span>
+    <span class="kw">def</span> <span class="prop">get_stats_contribution</span>()        <span class="comment"># pack-owned metrics</span>
+    <span class="kw">def</span> <span class="prop">get_explorer_fields</span>()           <span class="comment"># extra explorer columns</span>
+    <span class="kw">def</span> <span class="prop">get_import_transforms</span>()         <span class="comment"># shape import data</span>
+    <span class="kw">def</span> <span class="prop">get_export_transforms</span>()         <span class="comment"># shape export data</span></div>
+
+      <div style="margin-top: 14px">
+        <div class="two-col">
+          <div>
+            <div style="font-size: 12px; font-weight: 600; margin-bottom: 8px">Hook Surface Status</div>
+            <div style="display: flex; flex-direction: column; gap: 4px">
+              <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+                <span class="badge badge--active" style="min-width: 70px">Active</span> <code style="font-family: var(--font-mono); font-size: 10px">validate_registration()</code>
+              </div>
+              <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+                <span class="badge badge--active" style="min-width: 70px">Active</span> <code style="font-family: var(--font-mono); font-size: 10px">collect_approval_waivers()</code>
+              </div>
+              <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+                <span class="badge badge--active" style="min-width: 70px">Active</span> <code style="font-family: var(--font-mono); font-size: 10px">get_stats_contribution()</code>
+              </div>
+              <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+                <span class="badge badge--unused" style="min-width: 70px">Unused</span> <code style="font-family: var(--font-mono); font-size: 10px">get_explorer_fields()</code>
+              </div>
+              <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+                <span class="badge badge--unused" style="min-width: 70px">Unused</span> <code style="font-family: var(--font-mono); font-size: 10px">get_import_transforms()</code>
+              </div>
+              <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+                <span class="badge badge--unused" style="min-width: 70px">Unused</span> <code style="font-family: var(--font-mono); font-size: 10px">get_export_transforms()</code>
+              </div>
+            </div>
+          </div>
+          <div>
+            <div style="font-size: 12px; font-weight: 600; margin-bottom: 8px">Current Implementation</div>
+            <div class="inner-chip">
+              <div class="chip-title">RagCompatPack</div>
+              <div class="chip-desc">Owns RAG/retrieval domain: reference management, per-tool-call retrieval state, approval waivers for retrieval-only items, and refs→per-call migration.</div>
+            </div>
+            <div class="callout" style="margin-top: 8px; border-left-color: var(--gold)">
+              <strong>Not auto-discovered.</strong> To add a pack, edit <code>pack_registry.py</code> and register the new <code>PluginPack</code> instance manually.
+            </div>
+          </div>
+        </div>
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== TRACE ADAPTERS ===== -->
+    <div id="trace-adapters" class="sec-head anim" style="--i:12; color: var(--copper)">
+      <span class="dot" style="background: var(--copper)"></span> 3 · Trace Adapter Plugins
+    </div>
+
+    <div class="ve-card anim" style="--i:13; border-color: var(--copper-dim)">
+      <div class="card-header">
+        <div class="card-icon" style="background: var(--copper-dim); color: var(--copper)">TA</div>
+        <div class="card-title">
+          TraceAdapterPlugin
+          <small>backend/app/plugins/adapters/ · Auto-discovered</small>
+        </div>
+        <span class="badge badge--dormant">Dormant</span>
+        <span class="disc-badge disc-badge--auto">auto-discover</span>
+      </div>
+      <div class="card-body">
+        <strong>Import/conversion plugins</strong> for foreign trace payloads. Adapters convert raw external data into <code>AgenticGroundTruthEntry</code> objects. Auto-discovered like computed tags, but registers <strong>classes</strong> (not instances) since adapters need runtime constructor args.
+      </div>
+
+      <div class="iface">
+<span class="kw">class</span> <span class="prop">TraceAdapterPlugin</span>:
+    <span class="prop">name</span>: <span class="type">str</span>
+    <span class="kw">def</span> <span class="prop">adapt_payload</span>(payload, **kwargs) → <span class="type">list[AgenticGroundTruthEntry]</span></div>
+
+      <div class="two-col" style="margin-top: 14px">
+        <div class="inner-chip">
+          <div class="chip-title">TraceExportAdapter</div>
+          <div class="chip-desc">Maps trace-export payloads into GT entries: history, tool calls, context, feedback, trace IDs, metadata. Tags: <code>source:trace-export</code>, <code>workflow:agentic-rca</code>.</div>
+        </div>
+        <div class="callout" style="border-left-color: var(--copper); margin-top: 0">
+          <strong>Registry exists but isn't consumed.</strong> Current code instantiates <code>TraceExportAdapter</code> directly in <code>demo_seed.py</code> rather than going through the adapter registry. The auto-discovery infrastructure is ready for use but dormant.
+        </div>
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== EXPORT REGISTRY ===== -->
+    <div id="export-registry" class="sec-head anim" style="--i:14; color: var(--sage)">
+      <span class="dot" style="background: var(--sage)"></span> 4 · Export Registry
+    </div>
+
+    <div class="ve-card anim" style="--i:15; border-color: var(--sage-dim)">
+      <div class="card-header">
+        <div class="card-icon" style="background: var(--sage-dim); color: var(--sage)">EX</div>
+        <div class="card-title">
+          ExportProcessor &amp; ExportFormatter
+          <small>backend/app/exports/registry.py · Manually registered</small>
+        </div>
+        <span class="badge badge--active">Active</span>
+        <span class="disc-badge disc-badge--manual">manual register</span>
+      </div>
+      <div class="card-body">
+        Two complementary registries for the export pipeline. <strong>Processors</strong> transform normalized doc dicts before formatting. <strong>Formatters</strong> serialize docs into bytes/string payloads. Wired explicitly in <code>container.py</code> — no auto-discovery.
+      </div>
+
+      <div class="two-col" style="margin-top: 14px">
+        <div>
+          <div class="iface">
+<span class="kw">class</span> <span class="prop">ExportProcessor</span>(<span class="type">ABC</span>):
+    <span class="prop">name</span>: <span class="type">str</span>
+    <span class="kw">def</span> <span class="prop">process</span>(docs) → <span class="type">list[dict]</span></div>
+          <div class="inner-chip" style="margin-top: 8px">
+            <div class="chip-title">MergeTagsProcessor</div>
+            <div class="chip-desc">Merges manual + computed tags into a unified <code>tags</code> field</div>
+          </div>
+        </div>
+        <div>
+          <div class="iface">
+<span class="kw">class</span> <span class="prop">ExportFormatter</span>(<span class="type">ABC</span>):
+    <span class="prop">name</span>: <span class="type">str</span>
+    <span class="kw">def</span> <span class="prop">format</span>(docs) → <span class="type">bytes | str</span></div>
+          <div class="inner-chip" style="margin-top: 8px">
+            <div class="chip-title">JsonItemsFormatter</div>
+            <div class="chip-desc">Raw JSON array output</div>
+          </div>
+          <div class="inner-chip" style="margin-top: 4px">
+            <div class="chip-title">JsonSnapshotPayloadFormatter</div>
+            <div class="chip-desc">Envelope with <code>schemaVersion</code>, <code>snapshotAt</code>, <code>datasetNames</code></div>
+          </div>
+        </div>
+      </div>
+
+      <div class="callout" style="border-left-color: var(--sage)">
+        <strong>Processor chain is configurable</strong> via <code>EXPORT_PROCESSOR_ORDER</code> in config. <code>resolve_chain(requested, default_order)</code> determines the pipeline. Formatters support factory registration for parameterized construction at runtime.
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== TOOL-CALL EXTENSIONS ===== -->
+    <div id="toolcall-ext" class="sec-head anim" style="--i:16; color: var(--rose)">
+      <span class="dot" style="background: var(--rose)"></span> 5 · Tool-Call UI Extensions
+    </div>
+
+    <div class="ve-card anim" style="--i:17; border-color: var(--rose-dim)">
+      <div class="card-header">
+        <div class="card-icon" style="background: var(--rose-dim); color: var(--rose)">TC</div>
+        <div class="card-title">
+          ToolCallExtensionRegistration
+          <small>frontend/src/registry/ · Module side-effect</small>
+        </div>
+        <span class="badge badge--active">Active</span>
+        <span class="disc-badge disc-badge--sideeffect">module side-effect</span>
+      </div>
+      <div class="card-body">
+        Registry of React components rendered inline within tool-call detail views. Extensions register by <strong>discriminator</strong> (e.g. <code>toolCall:retrieval</code>) and optional <code>matches()</code> predicate. Multiple extensions can match a single tool call. Each renders inside a <code>PluginErrorBoundary</code> for fault isolation.
+      </div>
+
+      <div class="iface">
+<span class="kw">interface</span> <span class="prop">ToolCallExtensionRegistration</span> {
+    <span class="prop">discriminator</span>: <span class="type">string</span>;          <span class="comment">// e.g. "toolCall" or "toolCall:search"</span>
+    <span class="prop">component</span>: <span class="type">React.ComponentType</span>;
+    <span class="prop">displayName</span>: <span class="type">string</span>;
+    <span class="prop">matches</span>?: <span class="type">(tc) => boolean</span>;     <span class="comment">// optional predicate filter</span>
+}</div>
+
+      <div class="two-col" style="margin-top: 14px">
+        <div class="inner-chip">
+          <div class="chip-title">RAG References Extension</div>
+          <div class="chip-desc">Adds inline reference-management UI to retrieval-like tool calls. Matches: <code>search</code>, <code>retrieval</code>, <code>lookup</code>, <code>fetch</code>, <code>query</code>, <code>find</code>, <code>get_documents</code>, <code>vector_search</code>, and compound names.</div>
+        </div>
+        <div>
+          <div style="font-size: 12px; font-weight: 600; margin-bottom: 8px">Resolution Flow</div>
+          <div class="ext-pipeline">
+            <div class="ext-step" style="border-color: var(--rose-dim)">
+              <div class="step-label" style="color: var(--rose)">Resolve</div>
+              <div class="step-file">by discriminator</div>
+            </div>
+            <div class="ext-arrow">→</div>
+            <div class="ext-step" style="border-color: var(--rose-dim)">
+              <div class="step-label" style="color: var(--rose)">Filter</div>
+              <div class="step-file">matches()?</div>
+            </div>
+            <div class="ext-arrow">→</div>
+            <div class="ext-step" style="border-color: var(--rose-dim)">
+              <div class="step-label" style="color: var(--rose)">Render</div>
+              <div class="step-file">ErrorBoundary</div>
+            </div>
+          </div>
+        </div>
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== EXPLORER EXTENSIONS ===== -->
+    <div id="explorer-ext" class="sec-head anim" style="--i:18; color: var(--slate)">
+      <span class="dot" style="background: var(--slate)"></span> 6 · Explorer Extensions
+    </div>
+
+    <div class="ve-card anim" style="--i:19; border-color: var(--slate-dim)">
+      <div class="card-header">
+        <div class="card-icon" style="background: var(--slate-dim); color: var(--slate)">EE</div>
+        <div class="card-title">
+          ExplorerExtension
+          <small>frontend/src/registry/ExplorerExtensions.ts · Module side-effect</small>
+        </div>
+        <span class="badge badge--dormant">Partial</span>
+        <span class="disc-badge disc-badge--sideeffect">module side-effect</span>
+      </div>
+      <div class="card-body">
+        Plugin packs can contribute extra <strong>table columns</strong> and <strong>filter dimensions</strong> to the Questions Explorer view. Extensions register by <code>packName</code>. Re-registering the same pack replaces its bundle.
+      </div>
+
+      <div class="iface">
+<span class="kw">interface</span> <span class="prop">ExplorerExtension</span> {
+    <span class="prop">packName</span>: <span class="type">string</span>;
+    <span class="prop">columns</span>: <span class="type">ExplorerColumnExtension[]</span>;
+    <span class="prop">filters</span>: <span class="type">ExplorerFilterExtension[]</span>;
+}</div>
+
+      <div class="two-col" style="margin-top: 14px">
+        <div>
+          <div style="font-size: 12px; font-weight: 600; margin-bottom: 8px">Built-in: rag-compat</div>
+          <div class="inner-chip">
+            <div class="chip-title">Column: referenceCount</div>
+            <div class="chip-desc">Renders a "Refs" column showing <code>getItemReferences(item).length</code></div>
+          </div>
+          <div class="inner-chip" style="margin-top: 4px">
+            <div class="chip-title">Filter: hasReferences</div>
+            <div class="chip-desc">yes / no filter for items with references</div>
+          </div>
+        </div>
+        <div>
+          <div style="font-size: 12px; font-weight: 600; margin-bottom: 8px">Surface Status</div>
+          <div style="display: flex; flex-direction: column; gap: 4px">
+            <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+              <span class="badge badge--active" style="min-width: 70px">Active</span> Column extensions → rendered in <code style="font-family: var(--font-mono); font-size: 10px">QuestionsExplorer</code>
+            </div>
+            <div style="display: flex; align-items: center; gap: 8px; font-size: 11px">
+              <span class="badge badge--unused" style="min-width: 70px">Unused</span> Filter extensions → defined but no UI consumer found
+            </div>
+          </div>
+        </div>
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== DI WIRING ===== -->
+    <div id="wiring" class="sec-head anim" style="--i:20; color: var(--blue)">
+      <span class="dot" style="background: var(--blue)"></span> DI &amp; Bootstrap Wiring
+    </div>
+
+    <div class="ve-card anim" style="--i:21">
+      <div class="card-body" style="margin-bottom: 14px">
+        Each extension mechanism has a different relationship to the dependency injection container and startup lifecycle:
+      </div>
+
+      <div class="inner-grid" style="gap: 10px">
+        <div class="inner-chip" style="border-left: 3px solid var(--teal)">
+          <div class="chip-title">Computed Tags</div>
+          <div class="chip-desc"><strong>Module singleton.</strong> Lazy init via <code>get_default_registry()</code>. Called ad-hoc from services and routes. Not stored on the DI container.</div>
+        </div>
+        <div class="inner-chip" style="border-left: 3px solid var(--gold)">
+          <div class="chip-title">Plugin Packs</div>
+          <div class="chip-desc"><strong>Full DI wiring.</strong> Stored as <code>plugin_pack_registry</code> on the container. Validated during <code>startup_cosmos()</code>.</div>
+        </div>
+        <div class="inner-chip" style="border-left: 3px solid var(--copper)">
+          <div class="chip-title">Trace Adapters</div>
+          <div class="chip-desc"><strong>Module singleton (unused).</strong> Auto-discovery exists via <code>get_default_adapter_registry()</code> but is not wired into the container or called at runtime.</div>
+        </div>
+        <div class="inner-chip" style="border-left: 3px solid var(--sage)">
+          <div class="chip-title">Export Registries</div>
+          <div class="chip-desc"><strong>Explicit DI.</strong> Built in <code>container.py</code> via factory methods. Passed to <code>SnapshotService</code>.</div>
+        </div>
+        <div class="inner-chip" style="border-left: 3px solid var(--rose)">
+          <div class="chip-title">Tool-Call Extensions</div>
+          <div class="chip-desc"><strong>Module side-effect.</strong> Self-registers on import of <code>registry/index.ts</code>. No explicit boot code.</div>
+        </div>
+        <div class="inner-chip" style="border-left: 3px solid var(--slate)">
+          <div class="chip-title">Explorer Extensions</div>
+          <div class="chip-desc"><strong>Module side-effect.</strong> Self-registers when <code>ExplorerExtensions.ts</code> exports are imported.</div>
+        </div>
+      </div>
+    </div>
+
+    <div class="divider"></div>
+
+    <!-- ===== HOW TO EXTEND ===== -->
+    <div id="howto" class="sec-head anim" style="--i:22; color: var(--blue)">
+      <span class="dot" style="background: var(--blue)"></span> How to Add Extensions
+    </div>
+
+    <div class="two-col">
+      <div class="ve-card anim" style="--i:23; border-color: var(--teal-dim)">
+        <div class="card-header">
+          <div class="card-icon" style="background: var(--teal-dim); color: var(--teal)">+</div>
+          <div class="card-title">New Computed Tag</div>
+        </div>
+        <ol class="steps">
+          <li>Create a new <code>.py</code> file in <code>plugins/computed_tags/</code></li>
+          <li>Subclass <code>ComputedTagPlugin</code></li>
+          <li>Define <code>tag_key</code> property and <code>compute(doc)</code> method</li>
+          <li>Done — auto-discovered at next startup. Zero-arg constructor required.</li>
+        </ol>
+      </div>
+
+      <div class="ve-card anim" style="--i:24; border-color: var(--gold-dim)">
+        <div class="card-header">
+          <div class="card-icon" style="background: var(--gold-dim); color: var(--gold)">+</div>
+          <div class="card-title">New Plugin Pack</div>
+        </div>
+        <ol class="steps">
+          <li>Create a new <code>.py</code> in <code>plugins/packs/</code></li>
+          <li>Subclass <code>PluginPack</code>, implement hooks</li>
+          <li>Edit <code>pack_registry.py</code> to import and register</li>
+          <li>Wire any frontend counterpart in <code>ExplorerExtensions.ts</code></li>
+        </ol>
+      </div>
+
+      <div class="ve-card anim" style="--i:25; border-color: var(--copper-dim)">
+        <div class="card-header">
+          <div class="card-icon" style="background: var(--copper-dim); color: var(--copper)">+</div>
+          <div class="card-title">New Trace Adapter</div>
+        </div>
+        <ol class="steps">
+          <li>Create a new <code>.py</code> in <code>plugins/adapters/</code></li>
+          <li>Subclass <code>TraceAdapterPlugin</code></li>
+          <li>Implement <code>name</code> and <code>adapt_payload()</code></li>
+          <li>Auto-discovered — but wire a consumer to use the registry.</li>
+        </ol>
+      </div>
+
+      <div class="ve-card anim" style="--i:26; border-color: var(--sage-dim)">
+        <div class="card-header">
+          <div class="card-icon" style="background: var(--sage-dim); color: var(--sage)">+</div>
+          <div class="card-title">New Export Format</div>
+        </div>
+        <ol class="steps">
+          <li>Create processor in <code>exports/processors/</code> or formatter in <code>exports/formatters/</code></li>
+          <li>Subclass <code>ExportProcessor</code> or <code>ExportFormatter</code></li>
+          <li>Register in <code>container.py</code> build methods</li>
+          <li>Update <code>EXPORT_PROCESSOR_ORDER</code> config if needed</li>
+        </ol>
+      </div>
+
+      <div class="ve-card anim" style="--i:27; border-color: var(--rose-dim)">
+        <div class="card-header">
+          <div class="card-icon" style="background: var(--rose-dim); color: var(--rose)">+</div>
+          <div class="card-title">New Tool-Call Extension</div>
+        </div>
+        <ol class="steps">
+          <li>Create a React component accepting <code>ToolCallActionProps</code></li>
+          <li>Create a registration file importing <code>toolCallExtensions</code></li>
+          <li>Call <code>toolCallExtensions.register({...})</code></li>
+          <li>Import the file from <code>registry/index.ts</code> for side-effect init</li>
+        </ol>
+      </div>
+
+      <div class="ve-card anim" style="--i:28; border-color: var(--slate-dim)">
+        <div class="card-header">
+          <div class="card-icon" style="background: var(--slate-dim); color: var(--slate)">+</div>
+          <div class="card-title">New Explorer Extension</div>
+        </div>
+        <ol class="steps">
+          <li>Define column/filter specs in <code>ExplorerExtensions.ts</code></li>
+          <li>Call <code>registerExplorerExtension({...})</code></li>
+          <li>Column extensions render automatically in <code>QuestionsExplorer</code></li>
+          <li>Filter extensions need a UI consumer (not yet wired)</li>
+        </ol>
+      </div>
+    </div>
+
+  </div><!-- /main -->
+
+</div><!-- /wrap -->
+
+<script>
+(function() {
+  const toc = document.getElementById('toc');
+  const links = toc.querySelectorAll('a');
+  const sections = [];
+
+  links.forEach(link => {
+    const id = link.getAttribute('href').slice(1);
+    const el = document.getElementById(id);
+    if (el) sections.push({ id, el, link });
+  });
+
+  const observer = new IntersectionObserver(entries => {
+    entries.forEach(entry => {
+      if (entry.isIntersecting) {
+        links.forEach(l => l.classList.remove('active'));
+        const match = sections.find(s => s.el === entry.target);
+        if (match) {
+          match.link.classList.add('active');
+          if (window.innerWidth <= 1000) {
+            match.link.scrollIntoView({
+              behavior: 'smooth', block: 'nearest', inline: 'center'
+            });
+          }
+        }
+      }
+    });
+  }, { rootMargin: '-10% 0px -80% 0px' });
+
+  sections.forEach(s => observer.observe(s.el));
+
+  links.forEach(link => {
+    link.addEventListener('click', e => {
+      e.preventDefault();
+      const id = link.getAttribute('href').slice(1);
+      const el = document.getElementById(id);
+      if (el) {
+        el.scrollIntoView({ behavior: 'smooth', block: 'start' });
+        history.replaceState(null, '', '#' + id);
+      }
+    });
+  });
+})();
+</script>
+
+</body>
+</html>
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
index 3015aa2..baf9e4c 100644
--- a/docs/getting-started/quickstart.md
+++ b/docs/getting-started/quickstart.md
@@ -9,36 +9,38 @@ This guide will help you create your first ground truth item.
 
 ## Start the Backend
 
-1. Navigate to the backend directory:
+1. From the repository root, start the backend:
    ```bash
-   cd backend
+   make -f Makefile.harness backend
    ```
 
-2. Start the development server:
-   ```bash
-   uv run uvicorn app.main:app --reload
-   ```
-
-   The API will be available at `http://localhost:8000`.
+    The API will be available at `http://localhost:8000`.
 
-3. Verify the backend is running:
+2. Verify the backend is running:
    ```bash
    curl http://localhost:8000/healthz
    ```
 
 ## Start the Frontend
 
-1. In a new terminal, navigate to the frontend directory:
+1. In a new terminal from the repository root, start the frontend:
    ```bash
-   cd frontend
+   make -f Makefile.harness frontend
    ```
 
-2. Start the development server:
-   ```bash
-   npm run dev
-   ```
+    The app will be available at `http://localhost:5173` by default.
+
+    To run both services from one terminal instead, use `make -f Makefile.harness dev`.
+
+    For agent-friendly background startup, use `make -f Makefile.harness dev-up` and later `make -f Makefile.harness dev-down`. Background logs and PID files are written to `.harness/dev/`.
+
+    To launch the demo experience with seeded data and a fixed local user, run:
+
+    ```bash
+    VITE_DEMO_MODE=true VITE_DEV_USER_ID=demo-user make dev-up
+    ```
 
-   The app will be available at `http://localhost:5173`.
+    This starts both services in the background, enables demo mode in the frontend, and sets the backend demo user identity to `demo-user`.
 
 ## Create Your First Ground Truth Item
 
diff --git a/docs/prds/agentic-curation-redesign.md b/docs/prds/agentic-curation-redesign.md
new file mode 100644
index 0000000..5e33928
--- /dev/null
+++ b/docs/prds/agentic-curation-redesign.md
@@ -0,0 +1,356 @@
+<!-- markdownlint-disable-file -->
+<!-- markdown-table-prettify-ignore-start -->
+# Agentic Curation Redesign - Product Requirements Document (PRD)
+Version 0.6 | Status Draft | Owner TBD | Team Ground Truth Curator | Target TBD | Lifecycle Discovery
+
+## Progress Tracker
+| Phase | Done | Gaps | Updated |
+|-------|------|------|---------|
+| Context | ✅ | Confirm final plugin packaging decisions | 2026-03-11 |
+| Problem & Users | ✅ | Validate RAG compatibility stakeholders | 2026-03-11 |
+| Scope | ✅ | Confirm any legacy API-compatibility promises | 2026-03-11 |
+| Requirements | ✅ | Refine implementation sequencing with engineering | 2026-03-11 |
+| Metrics & Risks | ✅ | Validate performance budget for large traces | 2026-03-11 |
+| Operationalization | ✅ | Confirm rollout gates and temporary migration flags | 2026-03-11 |
+| Finalization | ⏳ | Approve supersession of prior PRD and translate resolved decisions into an implementation plan | 2026-03-11 |
+Unresolved Critical Questions: 0 | TBDs: 3
+
+## 1. Executive Summary
+### Context
+The current Ground Truth Curator implementation is centered on a RAG-specific curation flow. A prior PRD proposed adding agentic curation as a separate mode in the same codebase. That direction would preserve the RAG-shaped core and layer agentic behavior beside it, increasing duplication across models, UI trees, validators, and long-term maintenance paths.
+
+This PRD replaces that direction. The new initiative redesigns the current curation experience around an agentic-first, generic core derived from `wireframes/agent-curation-wireframe-v2.2.html` and `wireframes/gt_schema_v5_generic.py`. The redesign intentionally reuses durable plumbing such as API routing, services, storage, assignment, snapshot/export, auth, and computed-tag infrastructure, while allowing large portions of the current RAG-specific curation implementation to be deleted and rebuilt on top of the new core.
+
+### Core Opportunity
+Turn Ground Truth Curator into a single generic curation platform for agentic traces, tool decisions, context, feedback, and extensible domain-specific panels. Instead of preserving RAG as the host architecture, RAG becomes one plugin pack implemented through the new extension points. This exercises the platform's generic design immediately and prevents a second long-lived fork of the product.
+
+### Goals
+| Goal ID | Statement | Type | Baseline | Target | Timeframe | Priority |
+|---------|-----------|------|----------|--------|-----------|----------|
+| G-001 | Deliver the primary curation workflow defined in `agent-curation-wireframe-v2.2.html` as the product's default architecture | Business | Current UI is RAG-oriented | Wireframe-aligned agentic-first workflow implemented | TBD | P0 |
+| G-002 | Replace the RAG-shaped core data model and editor architecture with a generic schema-first core based on `gt_schema_v5_generic.py` | Technical | Core contracts are RAG-specific | Generic core owns CRUD, editing, review, and export | TBD | P0 |
+| G-003 | Recreate the existing RAG curation flow on the redesigned platform through plugins and documented extension points, not through a second app mode | Technical | RAG is the hard-coded host flow | RAG operates as a compatibility pack on the generic core | TBD | P0 |
+| G-004 | Delete redundant legacy curation code once the redesigned core and compatibility pack cover required behavior | Technical | Parallel RAG-oriented implementations exist | Legacy editor/mode-specific branches retired | TBD | P1 |
+| G-005 | Preserve and reuse proven plumbing where it remains valuable (routing, service orchestration, storage, assignment, exports, auth, telemetry, computed tags) | Operational | Plumbing is intertwined with current feature shape | Plumbing survives behind cleaner generic contracts | TBD | P1 |
+| G-006 | Improve long-term maintainability by ensuring future domain workflows are added via plugins rather than new top-level modes | Operational | New workflows imply app forking risk | New workflows fit extension model | TBD | P1 |
+
+### Objectives (Optional)
+| Objective | Key Result | Priority | Owner |
+|-----------|------------|----------|-------|
+| Establish agentic-first core | Core editor, explorer, and approval flow operate on the generic schema | P0 | TBD |
+| Prove generic architecture with RAG | Existing RAG flow works through plugin surfaces without reviving the old host architecture | P0 | TBD |
+| Simplify codebase | Legacy mode split and redundant editor paths are removed after parity | P1 | TBD |
+
+## 2. Problem Definition
+### Current Situation
+Ground Truth Curator currently assumes a RAG-shaped item: question/answer content, grounding references, expected behavior annotations, and reference-driven approval gating. The UX, models, validators, and explorer semantics all reflect that assumption.
+
+The desired future state is materially different:
+* Tool calls are a first-class review surface.
+* Agent traces, trace payloads, metadata, feedback, and plugin data are generic and variably shaped.
+* Conversations may involve flexible role names and multi-step agent behavior.
+* The dominant workflow is the agentic curation experience represented in the wireframe, not the current RAG form.
+
+### Problem Statement
+If agentic curation is added as a separate mode on top of the current RAG-centric architecture, the product will carry two competing cores: the old RAG-first one and the new agentic one. That creates duplicated UI stacks, duplicated validation rules, duplicated domain models, and permanent architectural drag. Because there are no current users of the existing setup, the better choice is to redesign now around the target workflow and treat RAG as a specialization of that new platform.
+
+### Root Causes
+* The current core domain contracts are organized around RAG concepts rather than generic agentic review primitives.
+* The earlier "separate mode" direction optimizes for coexistence instead of simplification.
+* Flexible fields such as feedback, metadata, trace payloads, tool responses, and plugin payloads need generic rendering and extension contracts, not hard-coded schemas in the host app.
+* Current approval and editing assumptions are tightly coupled to references and expected-behavior semantics that should become plugin-provided behavior for RAG, not universal product rules.
+
+### Impact of Inaction
+Retaining the current direction would:
+* Increase engineering cost for every future workflow by requiring more mode-specific branches.
+* Delay delivery of the wireframe-defined agentic workflow because the platform remains anchored to RAG semantics.
+* Limit the ability to delete old code and simplify the repository while no users depend on it.
+* Undermine the claim that the new schema and editor are genuinely generic because RAG would still be the real host architecture.
+
+## 3. Users & Personas
+| Persona | Goals | Pain Points | Impact |
+|---------|-------|------------|--------|
+| **Agentic Curator** - Reviews traces, tool usage, and responses | Curate agent conversations quickly and confidently; decide which tools were required; edit responses and metadata | Current product is optimized for references, not tool-centric review | Primary day-to-day user |
+| **Curation Lead** - Monitors throughput and data quality | Assign work, review progress, track status and quality signals across datasets | Current metrics and explorer semantics do not map cleanly to agentic work | Operational owner |
+| **ML Engineer / Evaluator** - Consumes approved data | Export generic agentic datasets with clear schema and plugin data | No stable generic contract for agentic evaluation datasets | Downstream consumer |
+| **Plugin Author** - Extends the product for domain workflows such as RAG | Add domain-specific panels, rules, metrics, and transforms without forking the app | Current architecture requires hard-coded product changes for each domain | Internal platform extender |
+
+### Journeys (Optional)
+1. Agentic Curator self-assigns an item from the queue, reviews conversation/tool calls/evidence, edits the item, and approves it.
+2. Curation Lead filters the explorer by tags, plugin-provided signals, and approval state to manage quality and throughput.
+3. Plugin Author adds a compatibility pack for RAG that restores reference-centric review behavior using core extension points rather than a new product mode.
+
+## 4. Scope
+### In Scope
+* Redesign the curation product around a single generic agentic-first core.
+* Adopt `AgenticGroundTruthEntry` and its supporting models from `gt_schema_v5_generic.py` as the foundation for core data contracts.
+* Build the primary UX from `agent-curation-wireframe-v2.2.html`, including queue/sidebar, split-pane editing, evidence drawer, tool-call review, trace/metadata/feedback surfaces, and lifecycle actions.
+* Define extension points for renderer contributions, plugin panels, validation rules, search/explorer fields, tagging, import/export transforms, and stats.
+* Reuse existing plumbing where it remains architecture-safe: FastAPI routing, services, repository abstractions, assignment lifecycle, snapshot/export pipeline, auth, telemetry, and computed-tag registry patterns.
+* Recreate the existing RAG curation flow as a compatibility pack implemented through the new extension surfaces.
+* Remove obsolete RAG-specific core code and any separate-mode branching after the redesigned architecture reaches required parity.
+
+### Out of Scope (justify if empty)
+* Long-lived runtime "RAG mode" versus "agentic mode" product branching.
+* Preserving the current RAG-specific editor tree as a parallel first-class experience.
+* One-off support for every historical data shape in the core schema; non-core shapes belong in adapters or plugins.
+* Permanent transitional read paths for legacy RAG payloads after the redesigned system normalizes them into the generic core contract.
+* Recreating every legacy RAG-specific UI affordance or stricter workflow detail when it is not needed for the essential curator flow in v1.
+* Integrating the optional rules engine into the core workflow in v1; it remains a separate tool/tab decision until the generic plugin model and RAG compatibility pack are proven.
+* Infrastructure redesign under `infra/`.
+
+### Assumptions
+* There are no current production users depending on the existing setup, so replacement and deletion are preferable to long-term coexistence.
+* The wireframe is the primary source of truth for the target user workflow and information architecture.
+* `gt_schema_v5_generic.py` is the primary source of truth for the generic core record contract.
+* Existing storage, assignment, and export plumbing can be reused if reshaped behind generic domain contracts.
+* The redesigned core may store domain-specific data in `plugins` and other flexible schema surfaces, provided core rendering and validation remain generic by default.
+
+### Constraints
+* Backend layering must remain `api/v1 -> services -> adapters`.
+* Frontend data-fetching must continue to flow through `frontend/src/api/` or `frontend/src/services/`.
+* Extension points must be explicit enough that RAG compatibility does not require reviving a parallel app or hidden mode branches.
+* Plugin failures must surface explicitly; the system must not silently fall back to success-shaped behavior.
+* The redesign should maximize deletion of obsolete code, not preserve it for comfort.
+
+## 5. Product Overview
+### Value Proposition
+Ground Truth Curator becomes a generic curation platform for agentic evaluation data. The host product natively understands conversations, tool calls, context entries, feedback, metadata, tags, provenance, and plugin payloads. Domain workflows such as RAG are layered on top through compatible plugins rather than hard-coded into the core.
+
+### Differentiators (Optional)
+* **Agentic-first, not mode-added** - The product is designed around the target workflow instead of bolting it onto a legacy flow.
+* **Schema-first generic core** - Flexible fields remain flexible, with sane default renderers and targeted overrides.
+* **Deletion-friendly redesign** - The architecture is intentionally chosen to enable removal of obsolete RAG-first code while risk is low.
+* **Compatibility through plugins** - Existing RAG behavior is preserved as a proof that the new architecture is truly generic.
+
+### UX / UI (Conditional)
+The target UX follows `agent-curation-wireframe-v2.2.html` and includes:
+* A queue sidebar for assignment, selection, and item status.
+* A split-pane workspace with conversation/context editing on the left and evidence/trace panels on the right.
+* Tool calls as a first-class review artifact with expandable details and required/optional/not-needed decisions.
+* Generic panels for trace data, metadata, feedback, tags, comments, and plugin-provided content.
+* Responsive behavior that collapses evidence into a mobile drawer.
+* Item lifecycle actions including save, approve, skip, delete, restore, and duplicate.
+
+Core extensibility model:
+* Default components exist for flexible schema fields (`feedback`, `metadata`, `plugins`, `tracePayload`, `ContextEntry.value`, `ToolCallRecord.response`).
+* The flexible surfaces that must be explicitly extensible are `feedback`, `metadata`, `plugins[].data`, `tracePayload`, `contextEntries[].value`, and `toolCalls[].response`.
+* Plugin packs can contribute custom field components, side panels, approval rules, explorer columns, derived summaries, import/export transforms, computed tags, and stats cards.
+* `plugins[].data` and `toolCalls[].response` must support action-capable review components, not just passive renderers.
+* Retrieval-oriented workflows such as RAG treat retrieval as a specialized tool-call review experience. Retrieval adapters normalize backend-specific search results into a common candidate-reference contract while preserving raw backend responses for fallback rendering and auditability.
+* Selected primary and bonus references persist as plugin-owned data keyed to the retrieval tool call that produced them, and plugin-derived item summaries feed queue, explorer, and approval views.
+* The RAG compatibility pack must use these same surfaces to recreate reference-focused curation without reinstating a RAG-first host architecture.
+
+## 6. Functional Requirements
+| FR ID | Title | Description | Goals | Personas | Priority | Acceptance | Notes |
+|-------|-------|-------------|-------|----------|----------|-----------|-------|
+| FR-001 | Agentic-first core architecture | The product shall provide one generic curation architecture rather than separate RAG and agentic modes. Core application wiring shall load a generic editor/explorer/workspace and register optional plugin packs against documented extension points. | G-001, G-002, G-006 | Agentic Curator, Plugin Author | P0 | No permanent runtime mode split exists in the final design; core startup loads one generic architecture with plugin registration | Temporary migration toggles may exist during implementation but must not survive launch |
+| FR-002 | Generic ground-truth domain model | The backend and frontend shall adopt the generic schema in `gt_schema_v5_generic.py` as the core item contract, including history, contextEntries, toolCalls, expectedTools, feedback, metadata, plugins, comment, provenance, and tracePayload. | G-001, G-002 | Agentic Curator, ML Engineer | P0 | CRUD, serialization, editing, and export operate on the generic schema with the documented camelCase aliases | Core must not reintroduce RAG-only required fields |
+| FR-003 | Reusable plumbing preservation | Existing routing, service orchestration, repository abstractions, assignment lifecycle, auth, telemetry, snapshot/export pipeline, and computed-tag infrastructure shall be reused where compatible with the redesigned core. | G-005 | Curation Lead, ML Engineer | P1 | Architecture review shows plumbing is retained behind new generic contracts instead of rewritten from scratch | Reuse plumbing, not RAG-first semantics |
+| FR-004 | Wireframe-aligned workspace shell | The frontend shall implement the primary curation shell described in the wireframe, including queue/sidebar, split-pane layout, draggable gutter, evidence area, item actions, and mobile evidence drawer. | G-001 | Agentic Curator | P0 | Workspace behavior and panel hierarchy match the wireframe's primary interaction model | The wireframe is the UX source of truth |
+| FR-005 | Queue and selection workflow | Curators shall view, refresh, select, and request/assign items from a queue that shows item ID, status, category/tag hints, and conversation preview. | G-001 | Agentic Curator, Curation Lead | P0 | Queue supports item selection and status awareness without opening a second application mode | |
+| FR-006 | Conversation display and editing | The workspace shall display conversation history with flexible role strings and permit editing of applicable agent/user turns inline. | G-001, G-002 | Agentic Curator | P0 | Conversation renders arbitrary role labels; edited content persists on save; empty messages are blocked by validation | Role names are not limited to a fixed enum |
+| FR-007 | Tool calls as first-class review objects | Tool calls shall render as an ordered, expandable review surface with sequence metadata, grouping support for parallel calls, tool/subagent identity, arguments, responses, and decision controls. | G-001, G-002 | Agentic Curator | P0 | Tool calls are visible without plugin code; expanded details show arguments and responses; parallel group information is preserved | Derived from wireframe and generic schema |
+| FR-008 | Tool necessity decisions | Curators shall mark tool calls as required, optional, or not needed via the `expectedTools` model. The default state is allowed/optional, and approval shall require at least one required tool unless a plugin explicitly overrides that rule for a workflow. | G-001, G-002 | Agentic Curator | P0 | Decisions persist to `expectedTools`; overlap validation is enforced; approval is blocked when no required tool exists in core agentic flow | Aligns with `AGENTIC_REQUIREMENTS.md` |
+| FR-009 | Context entry editing | Curators shall add, edit, and remove `contextEntries` as key/value pairs using generic editing controls that support primitive and structured values. | G-001, G-002 | Agentic Curator | P1 | Context entries round-trip without losing type information | |
+| FR-010 | Generic evidence and detail panels | The workspace shall expose core panels for feedback, metadata, trace payload, tags, curator comments, and plugin data, with default renderers that can handle unknown shapes gracefully and explicitly. | G-001, G-002 | Agentic Curator | P0 | Unknown data shapes still render meaningfully through default components; no field is hidden because a custom renderer is missing | Default renderers may use key/value or JSON tree views |
+| FR-011 | Field component registry | The frontend shall provide a field component registry that resolves default and plugin-contributed components for flexible fields by discriminator (for example plugin kind, feedback source, metadata signature, tool name, or context key) and falls back to defaults when no custom component is registered. The registry shall support passive renderers and action-capable review components. | G-002, G-003 | Plugin Author | P0 | Registry supports register and resolve; unknown discriminators use documented defaults; action-capable components can be registered for `plugins[].data` and `toolCalls[].response` without modifying core component code | |
+| FR-012 | Plugin pack contribution model | Plugin packs shall be able to contribute field component overrides, action-capable review components, supplemental workspace panels, retrieval review experiences, explorer columns/filters, derived item summaries, validation rules, computed tags, import/export transforms, and metrics cards through documented extension points. | G-002, G-003, G-006 | Plugin Author, Curation Lead | P0 | A plugin pack can add domain behavior without forking the core app, reviving a mode branch, or flattening per-tool-call workflow data into host-specific fields | This is the primary extensibility contract |
+| FR-013 | Core approval workflow | Core approval shall validate generic integrity rules for agentic items, including valid history content, non-empty edited fields, and tool decision completeness. Plugin packs may add domain-specific approval gates on top. | G-001, G-002 | Agentic Curator | P0 | Generic approval rules run for all items; plugin rules can block approval with explicit messages; approval behavior is deterministic and testable | RAG-specific reference gates move to the RAG pack |
+| FR-014 | RAG compatibility pack | The redesigned platform shall include a RAG compatibility pack that reproduces the essential existing RAG-focused curation flow through plugin surfaces. This includes retrieval-specific search integration, curator selection of primary and bonus references, persisted plugin-owned reference data, reference-centric validation, multi-turn reference attachment where applicable, and any additional explorer/approval semantics needed for RAG curation. Richer legacy RAG-only affordances that are not required for the curator's essential workflow may be simplified in v1. | G-003 | Agentic Curator, Plugin Author, ML Engineer | P0 | Existing RAG flow is achievable on the new platform without a dedicated app mode or second editor tree, and RAG reference state persists without reintroducing RAG-only core fields | This requirement proves the generic architecture is real |
+| FR-015 | Compatibility data adapters | The platform shall support plugin-provided import/export or adapter logic so workflows such as RAG can project their domain-specific shapes into the generic core contract and back out to the required snapshot shape. Legacy RAG payload compatibility shall be handled at those boundaries rather than through a permanent transitional read path in the redesigned core. | G-003, G-005 | ML Engineer, Plugin Author | P1 | RAG compatibility pack can ingest and export its workflow data without changing the generic core schema or requiring a shipped dual-read compatibility layer | Adapter ownership belongs outside the core data contract |
+| FR-016 | Explorer extensibility | The explorer/list view shall support a generic baseline of columns and filters plus plugin-contributed columns, derived fields, and filter controls. | G-001, G-003 | Curation Lead | P1 | Core explorer works for generic items; plugin pack adds workflow-specific filters/columns without core forks | |
+| FR-017 | Tagging extensibility | The system shall support manual tags and registry-driven computed tags in the generic core, and plugin packs may contribute domain-specific tag providers or glossaries. | G-005, G-006 | Curation Lead, Plugin Author | P1 | Tags remain visible and editable through one shared pattern; plugin-provided computed tags appear without custom core branches | Reuse current registry pattern where practical |
+| FR-018 | Metrics and stats extensibility | The stats experience shall provide generic operational metrics plus plugin-contributed workflow metrics. | G-005, G-006 | Curation Lead | P1 | Core metrics render for all datasets; plugin-specific cards appear through the extension model | |
+| FR-019 | Lifecycle actions and concurrency | Save draft, approve, skip, delete, restore, duplicate, assignment, and ETag-based concurrency controls shall continue to operate on the redesigned core items. | G-001, G-005 | Agentic Curator, Curation Lead | P0 | Lifecycle operations work with the generic schema and preserve current concurrency expectations | Reuse existing assignment/concurrency plumbing where possible |
+| FR-020 | Legacy code retirement | Once the generic core and RAG compatibility pack meet acceptance, the team shall remove obsolete RAG-first editor stacks, separate-mode branches, and no-longer-used validation paths. | G-004 | Engineering Team | P1 | Final architecture contains one core editor stack and documented plugin packs rather than parallel host implementations | Deletion is part of the requirement, not optional cleanup |
+| FR-021 | Plugin contract documentation | The project shall document how to build a plugin pack, what extension points are available, and what guarantees the core provides. | G-003, G-006 | Plugin Author | P1 | Engineers can implement a new plugin pack using repository documentation and tests rather than reverse-engineering the core | Documentation may live alongside existing architecture/docs surfaces |
+| FR-022 | Explicit startup validation | Plugin registration and core-plugin contract validation shall run at startup, and invalid plugins shall fail explicitly rather than being silently ignored. | G-005, G-006 | Plugin Author, Curation Lead | P1 | Misconfigured plugin packs surface actionable startup errors; the system does not run in a partially wired state without notice | Protects correctness of a plugin-based architecture |
+| FR-023 | Feedback component surface | The feedback surface shall expose a documented default component for unknown feedback shapes and allow plugin overrides keyed by feedback source or discriminator. | G-002, G-006 | Agentic Curator, Plugin Author | P1 | Unknown feedback payloads remain readable by default; known sources can register custom components without host changes | |
+| FR-024 | Metadata component surface | The metadata surface shall expose a documented default component for arbitrary metadata structures and allow plugin overrides keyed by metadata signature or discriminator. | G-002, G-006 | Agentic Curator, Plugin Author | P1 | Flat and nested metadata render through documented defaults; custom components can override known shapes without host changes | |
+| FR-025 | Plugin payload component surface | Each `plugins[].data` payload shall render through a documented default component and may be upgraded to an action-capable review component keyed by plugin kind when the workflow requires editing or guided curation behavior. | G-002, G-003, G-006 | Agentic Curator, Plugin Author | P0 | Unknown plugin payloads remain readable by default; plugin-specific review components can edit and persist plugin-owned data without host schema changes | |
+| FR-026 | Trace payload component surface | The trace payload surface shall expose a documented default component for arbitrary trace payloads and allow plugin overrides for known trace formats or workflow-specific evidence views. | G-001, G-002 | Agentic Curator, Plugin Author | P1 | Arbitrary trace payloads remain inspectable by default; known formats can register richer components without host forks | |
+| FR-027 | Context value component surface | Each `contextEntries[].value` shall render through a documented default component based on value shape and may be overridden by plugin components keyed by context entry type or key pattern. | G-001, G-002 | Agentic Curator, Plugin Author | P1 | Primitive and structured context values render and round-trip through defaults; plugins can provide richer components for known contexts | |
+| FR-028 | Tool response component surface | Each `toolCalls[].response` shall render through a documented default component and may be upgraded to an action-capable review component keyed by tool identity when a workflow requires structured review or editing actions. | G-001, G-002, G-003 | Agentic Curator, Plugin Author | P0 | Unknown tool responses remain inspectable by default; tool-specific review components can add structured review behavior without host forks | |
+| FR-029 | Retrieval tool-call specialization | The RAG compatibility pack shall treat retrieval as a specialized `toolCalls[].response` review experience. Retrieval review components shall allow curators to inspect normalized candidates, select primary and bonus references, and persist those decisions as plugin-owned data keyed to the retrieval tool call instance. | G-003 | Agentic Curator, Plugin Author, ML Engineer | P0 | Retrieval remains a tool-call-centered workflow; selected references are stored per retrieval call rather than flattened into host-level RAG fields | |
+| FR-030 | Derived retrieval summaries | Plugin packs shall be able to derive item-level summaries from per-tool-call retrieval data for queue, explorer, and approval flows without flattening the stored per-call data model. | G-003, G-005, G-006 | Curation Lead, Plugin Author | P1 | Queue/explorer/approval can consume derived signals such as selected-reference counts or unresolved retrieval steps while canonical storage remains per retrieval call | |
+| FR-031 | Retrieval candidate contract | Retrieval adapters shall normalize different search engine outputs into a common candidate-reference contract for retrieval review while preserving the raw backend response for fallback rendering, auditability, and export logic. | G-003, G-005 | ML Engineer, Plugin Author | P0 | At least one retrieval review component can operate across multiple backend adapters through the common candidate contract, and the raw backend payload remains available | |
+
+### Feature Hierarchy (Optional)
+```plain
+Ground Truth Curator
+├── Generic Core
+│   ├── Queue / Explorer / Stats
+│   ├── Workspace Shell
+│   ├── Conversation Editing
+│   ├── Tool Call Review
+│   ├── Context / Feedback / Metadata / Trace Panels
+│   ├── Assignment / Save / Approval / Export
+│   └── Field Component + Plugin Registries
+├── Plugin Packs
+│   ├── RAG Compatibility Pack
+│   │   ├── Retrieval Tool-Call Review
+│   │   ├── Candidate Reference Normalizers
+│   │   ├── Per-Call Reference Selection State
+│   │   ├── Derived Queue / Explorer / Approval Summaries
+│   │   ├── RAG Approval Rules
+│   │   ├── RAG Explorer Fields
+│   │   └── RAG Import / Export Adapters
+│   └── Future Domain Packs
+└── Shared Plumbing
+    ├── API Routes / Services / Repositories
+    ├── Auth / Telemetry / Snapshot Pipeline
+    └── Computed Tag Infrastructure
+```
+
+## 7. Non-Functional Requirements
+| NFR ID | Category | Requirement | Metric/Target | Priority | Validation | Notes |
+|--------|----------|-------------|---------------|----------|-----------|-------|
+| NFR-001 | Performance | The workspace shall render a representative item with up to 20 tool calls and a large trace payload without visible jank on standard developer hardware | Initial interactive render <= 1s for reference dataset | P0 | Frontend performance test using representative wireframe-like data | |
+| NFR-002 | Performance | Draft save operations shall remain within current acceptable latency bounds despite larger generic payloads | Save round-trip <= 2s P95 | P0 | API/load testing | |
+| NFR-003 | Reliability | ETag concurrency control shall prevent lost updates in the redesigned core | Zero lost updates in concurrent edit tests | P0 | Concurrency integration tests | |
+| NFR-004 | Scalability | Bulk import and snapshot export shall support large generic datasets without schema-specific hacks in the core | 1000-item import/export succeeds without timeout | P1 | Batch import/export test | |
+| NFR-005 | Security | Existing auth, RBAC, and PII handling patterns shall continue to apply to the redesigned core and plugin-provided panels | No reduction in current security posture | P0 | Security review and regression tests | Plugins must not bypass core data protections |
+| NFR-006 | Accessibility | Core workspace interactions shall remain keyboard accessible and preserve readable contrast/state signaling | Keyboard navigation and contrast meet current accessibility standards | P1 | Manual accessibility review | |
+| NFR-007 | Maintainability | The final product shall ship one host editor architecture, not duplicated RAG and agentic host stacks | No permanent duplicate host stacks remain | P0 | Architecture/code review | |
+| NFR-008 | Extensibility | The RAG compatibility pack shall be implementable through documented extension points without modifying core host behavior beyond those contracts | RAG pack delivered without restoring a second host architecture | P0 | Implementation review against extension-point inventory | |
+| NFR-009 | Observability | Plugin registration, approval failures, and major lifecycle events shall emit structured telemetry consistent with existing observability conventions | Events appear in `.harness/logs.jsonl` or equivalent runtime telemetry | P1 | Harness verification / log inspection | |
+| NFR-010 | Startup correctness | Invalid or incomplete plugin wiring shall fail fast with actionable errors instead of silently degrading behavior | 100% of plugin contract failures produce explicit startup error paths | P1 | Startup validation tests | |
+| NFR-011 | Runtime isolation | Plugin-contributed field components shall run behind error boundaries with explicit fallback behavior so runtime failures degrade to safe default views rather than breaking the curation workspace | 100% of injected component failures preserve workspace operation with visible fallback state | P0 | Frontend fault-injection test / manual review | Applies to retrieval review components and other flexible-field overrides |
+| NFR-012 | Low-overhead extensibility | Optional plugin-contributed field components shall be lazy-loaded and add near-zero startup overhead when unused | Registry initialization <= 5 ms; no additional network requests until a custom component is needed | P1 | Startup performance test with default-only configuration | Prevents optional RAG surfaces from bloating the generic host |
+
+## 8. Data & Analytics (Conditional)
+### Inputs
+Core inputs include generic ground-truth records matching `agentic-core/v1`, trace payloads, context entries, tool call records, manual/computed tags, and plugin-defined supplemental payloads.
+
+### Outputs / Events
+Approved datasets, snapshot exports, item lifecycle events, plugin validation outcomes, and operational metrics for queue/explorer/stats views.
+
+### Instrumentation Plan
+| Event | Trigger | Payload | Purpose | Owner |
+|-------|---------|---------|---------|-------|
+| `curation.item_saved` | User saves draft | item id, dataset, plugin pack, status, validation summary | Track editing throughput and save health | TBD |
+| `curation.item_approved` | Item approved | item id, dataset, plugin pack, tool decision count | Measure approval throughput and completeness | TBD |
+| `curation.plugin_validation_failed` | Plugin blocks action or startup | plugin name, error code, action | Detect broken plugin packs quickly | TBD |
+| `curation.renderer_fallback_used` | Default renderer used for flexible payload | field type, discriminator, plugin pack | Identify missing custom renderers | TBD |
+| `curation.snapshot_exported` | Snapshot/export succeeds | dataset, item count, plugin pack | Track downstream dataset generation | TBD |
+
+### Metrics & Success Criteria
+| Metric | Type | Baseline | Target | Window | Source |
+|--------|------|----------|--------|--------|--------|
+| Curator throughput | Operational | Current baseline tied to RAG-only flow | Meets or exceeds current curation throughput after redesign | TBD | App telemetry |
+| Approval success rate | Quality | No generic agentic baseline | Stable approval flow with actionable failures | TBD | App telemetry |
+| RAG compatibility coverage | Delivery | 0% on redesigned core | Existing RAG flow demonstrably runs via plugin pack | Release gate | Integration tests / manual review |
+| Legacy host code retired | Maintainability | Parallel host code exists today | Obsolete host branches removed before completion | Release gate | Architecture review |
+
+## 9. Dependencies
+| Dependency | Type | Criticality | Owner | Risk | Mitigation |
+|-----------|------|-------------|-------|------|-----------|
+| `wireframes/agent-curation-wireframe-v2.2.html` | UX reference | Critical | Product/Design | Misreading target workflow | Treat wireframe as primary UX reference and review gaps explicitly |
+| `wireframes/gt_schema_v5_generic.py` | Data contract | Critical | Engineering | Schema drift during implementation | Keep schema as core source of truth |
+| Existing API/service/repository plumbing | Internal system | High | Engineering | Reuse boundaries may be leaky | Refactor behind generic domain contracts |
+| Current computed-tag registry pattern | Internal pattern | Medium | Engineering | Registry may be too RAG-shaped | Extend or refactor registry instead of duplicating it |
+| RAG workflow knowledge | Domain reference | High | Product/Engineering | Compatibility pack may miss must-have behavior | Validate parity with current flow before legacy deletion |
+
+## 10. Risks & Mitigations
+| Risk ID | Description | Severity | Likelihood | Mitigation | Owner | Status |
+|---------|-------------|----------|------------|------------|-------|--------|
+| R-001 | The core becomes "generic" in name only and still encodes hidden RAG assumptions | High | Medium | Make RAG prove the extension model by living entirely in a compatibility pack | TBD | Open |
+| R-002 | Over-generalization slows delivery of the wireframe-defined primary flow | High | Medium | Prioritize the wireframe flow first; add only extension points needed to support RAG compatibility and clear future growth | TBD | Open |
+| R-003 | Legacy code remains indefinitely because parity is never made explicit | Medium | Medium | Make deletion a release requirement with clear acceptance criteria | TBD | Open |
+| R-004 | Large trace payloads and flexible renderers degrade editor performance | Medium | Medium | Set performance budgets, use representative data, and optimize large payload rendering paths | TBD | Open |
+| R-005 | Plugin contracts are too weak, forcing direct core modifications for compatibility packs | High | Medium | Define and test extension surfaces early, then implement the RAG pack against them | TBD | Open |
+
+## 11. Privacy, Security & Compliance
+### Data Classification
+Ground-truth items and trace payloads may contain sensitive operational or user-derived data and must retain the current repository's privacy and access controls.
+
+### PII Handling
+The redesigned core shall continue using existing backend privacy and PII handling patterns. Plugin packs and renderers must not expose raw sensitive data outside those controls.
+
+### Threat Considerations
+The plugin model increases the number of extension surfaces, so plugin registration, renderer execution, and export transforms must be constrained to trusted repository code and validated explicitly at startup.
+
+### Regulatory / Compliance (Conditional)
+| Regulation | Applicability | Action | Owner | Status |
+|-----------|--------------|--------|-------|--------|
+| Internal data handling policies | Applicable | Preserve current auth, auditing, and PII controls through redesign | TBD | Open |
+
+## 12. Operational Considerations
+| Aspect | Requirement | Notes |
+|--------|-------------|-------|
+| Deployment | Deploy one redesigned curation product with plugin registration, not separate long-lived host modes | |
+| Rollback | Roll back by reverting the redesign deployment if critical defects appear before legacy deletion is finalized | |
+| Monitoring | Reuse current telemetry and harness conventions for lifecycle and plugin events | |
+| Alerting | Surface plugin startup and approval validation failures in operational telemetry | |
+| Support | Engineering team supports core; domain owners support their plugin packs | |
+| Capacity Planning | Size performance testing around large trace payloads and tool-call-heavy items | |
+
+## 13. Rollout & Launch Plan
+### Phases / Milestones
+| Phase | Date | Gate Criteria | Owner |
+|------|------|---------------|-------|
+| Core contract and extension design | TBD | Generic domain contract, plugin interfaces, and deletion targets approved | TBD |
+| Wireframe-first workspace delivery | TBD | Primary agentic workflow matches wireframe for generic core items | TBD |
+| RAG compatibility pack | TBD | Existing RAG flow works on the new platform via plugins and adapters | TBD |
+| Legacy retirement and hardening | TBD | Obsolete host code removed; performance, telemetry, and rollout checks pass | TBD |
+
+### Feature Flags (Conditional)
+| Flag | Purpose | Default | Sunset Criteria |
+|------|---------|---------|----------------|
+| Temporary migration flags only | Support implementation sequencing while old and new code coexist briefly | Off in final release | Remove before launch of redesigned host architecture |
+
+### Communication Plan (Optional)
+Communicate clearly that this PRD supersedes the earlier separate-mode direction. Reviewers should evaluate the redesign based on whether the generic core can host both the target agentic workflow and the RAG compatibility workflow without architectural duplication.
+
+## 14. Open Questions
+| Q ID | Question | Owner | Deadline | Status |
+|------|----------|-------|----------|--------|
+| Q-001 | What is the minimal plugin surface set required for the RAG compatibility pack to achieve parity without hidden core exceptions? | TBD | TBD | Resolved |
+| Q-002 | Should legacy RAG API payloads be adapted at import/export boundaries only, or is a transitional read-path needed during implementation? | TBD | TBD | Resolved |
+| Q-003 | Which current RAG behaviors are mandatory for v1 compatibility versus acceptable to simplify while there are no active users? | TBD | TBD | Resolved |
+| Q-004 | Should the optional rules engine remain fully separate in v1, or should the plugin model reserve a standard hook for future rules-engine integration now? | TBD | TBD | Resolved |
+
+Resolved Q-001: The minimal RAG compatibility surface must include a retrieval review plugin experience that can bind domain search backends, let curators select primary and bonus references, persist those selections in plugin-owned data on the generic core, and integrate with validation, explorer, computed-tag, and import/export extension points. The generic host schema must not regain a RAG-only `refs` field.
+
+Resolved Q-002: Legacy RAG payloads should be adapted at import/export boundaries only. The redesigned system should store and read the normalized generic record, and any migration toggles used during implementation must remain temporary rather than becoming a shipped transitional read path.
+
+Resolved Q-003: v1 compatibility should preserve the curator-visible essentials of the current RAG flow: retrieval/search, primary and bonus reference selection, reference visit validation when enabled, multi-turn reference attachment where applicable, assignment/approval/export lifecycle continuity, and tags. Richer legacy RAG-only affordances such as stricter key-paragraph expectations, older grounding-summary/evaluation-specific surfaces, and similar non-essential conveniences may be simplified while there are no active users.
+
+Resolved Q-004: The optional rules engine should remain fully separate in v1. The plugin model should not reserve a dedicated standard hook for rules-engine integration until the generic plugin model and the RAG compatibility pack are proven in practice.
+
+## 15. Changelog
+| Version | Date | Author | Summary | Type |
+|---------|------|--------|---------|------|
+| 0.6 | 2026-03-11 | Copilot | Expanded extensibility requirements to explicitly cover six flexible field surfaces, action-capable tool/plugin components, retrieval-as-tool-call review, derived summaries, common retrieval candidate contracts, and runtime/performance guardrails | Update |
+| 0.5 | 2026-03-11 | Copilot | Resolved Q-004 by keeping the optional rules engine separate in v1 with no reserved standard hook yet | Update |
+| 0.4 | 2026-03-11 | Copilot | Resolved Q-003 by preserving essential RAG curator workflow while allowing simplification of non-essential legacy affordances in v1 | Update |
+| 0.3 | 2026-03-11 | Copilot | Resolved Q-002 by defining boundary-only legacy RAG payload adaptation with no permanent transitional read path | Update |
+| 0.2 | 2026-03-11 | Copilot | Resolved Q-001 by defining retrieval review as a plugin-owned RAG compatibility surface with persisted primary and bonus references | Update |
+| 0.1 | 2026-03-11 | Copilot | Created new PRD that supersedes the separate-mode direction and defines an agentic-first redesign with RAG compatibility via plugins | Draft |
+
+## 16. References & Provenance
+| Ref ID | Type | Source | Summary | Conflict Resolution |
+|--------|------|--------|---------|--------------------|
+| REF-001 | Wireframe | `wireframes/agent-curation-wireframe-v2.2.html` | Primary UX definition for the target agentic-first curation workflow | Takes precedence for workspace and interaction design |
+| REF-002 | Schema | `wireframes/gt_schema_v5_generic.py` | Primary generic data contract for the redesigned core | Takes precedence for core record shape |
+| REF-003 | Notes | `wireframes/AGENTIC_REQUIREMENTS.md` | Supporting requirements and Peter feedback, including tool-call-first review, removal of grounding summary/evaluation criteria, and required-tool expectation | Used to clarify behaviors where the wireframe/schema are implicit |
+| REF-004 | Superseded PRD | `docs/prds/agentic-curation-mode.md` | Prior direction based on separate mode; retained only as historical context | This PRD supersedes it where directions conflict |
+
+### Citation Usage
+The wireframe defines the target user workflow and visual hierarchy. The generic schema defines the core persisted record shape. `AGENTIC_REQUIREMENTS.md` clarifies supporting behavioral expectations that are not fully expressed in the schema alone. The older `agentic-curation-mode.md` is included both to preserve history and to carry forward the earlier explicit inventory of pluggable flexible-field surfaces, now reframed for the redesigned single-core architecture.
+
+## 17. Appendices (Optional)
+### Glossary
+| Term | Definition |
+|------|------------|
+| Generic core | The redesigned host architecture that understands agentic schema primitives and plugin extension points |
+| Plugin pack | A bundled set of renderers, rules, adapters, metrics, and panels that implement a domain workflow on top of the generic core |
+| RAG compatibility pack | The plugin pack that recreates the existing reference-centric RAG curation flow on the generic core |
+| Legacy host code | Current RAG-first editor/mode-specific paths targeted for retirement after parity |
+
+### Additional Notes
+This PRD intentionally treats code deletion and host-architecture simplification as product requirements. Because there are no current users of the existing setup, the redesign should optimize for the best future architecture rather than preserving the previous one.
+
+Generated 2026-03-11T04:10:52.010Z by GitHub Copilot CLI (mode: full)
+<!-- markdown-table-prettify-ignore-end -->
diff --git a/docs/prds/rag-workflow-current-state.md b/docs/prds/rag-workflow-current-state.md
new file mode 100644
index 0000000..1a69a16
--- /dev/null
+++ b/docs/prds/rag-workflow-current-state.md
@@ -0,0 +1,306 @@
+<!-- markdownlint-disable-file -->
+<!-- markdown-table-prettify-ignore-start -->
+# RAG Workflow Current State - Product Requirements Document (PRD)
+Version 1.0 | Status BASELINE | Owner TBD | Team Ground Truth Curator | Target Current Production Behavior | Lifecycle Current State
+
+## Progress Tracker
+| Phase | Done | Gaps | Updated |
+|-------|------|------|---------|
+| Context | ✅ | — | 2026-03-11 |
+| Problem & Users | ✅ | — | 2026-03-11 |
+| Scope | ✅ | — | 2026-03-11 |
+| Requirements | ✅ | Minor source conflicts noted in Open Questions | 2026-03-11 |
+| Metrics & Risks | ✅ | Live metric baselines need owner confirmation | 2026-03-11 |
+| Operationalization | ✅ | Environment-specific details remain deployment-specific | 2026-03-11 |
+| Finalization | 🔶 | Stakeholder review pending | 2026-03-11 |
+Unresolved Critical Questions: 3 | TBDs: 3
+
+## 1. Executive Summary
+### Context
+Ground Truth Curator currently operates a RAG-oriented curation workflow. Curators receive or claim work items, inspect generated question-and-answer content, search for and attach supporting references, edit content and metadata, and approve items once grounding requirements are satisfied.
+
+This document captures the **current-state** product behavior at a high level so future work can preserve proven workflow expectations, identify intentional deltas, and avoid introducing regressions while adjacent initiatives evolve.
+
+### Core Opportunity
+Provide a single baseline artifact that explains what the RAG workflow already does today across assignment, search, editing, reference management, approval, and export so product, engineering, and operations teams can align on incumbent behavior.
+
+### Goals
+| Goal ID | Statement | Type | Baseline | Target | Timeframe | Priority |
+|---------|-----------|------|----------|--------|-----------|----------|
+| G-001 | Preserve the existing assignment-based curation workflow for RAG items | Product | Behavior spread across code and specs | Single documented baseline | Current state | P0 |
+| G-002 | Preserve reference-grounded approval quality gates | Quality | Implemented in current workflow | Baseline documented and testable | Current state | P0 |
+| G-003 | Preserve support for single-turn and multi-turn curation flows where currently supported | Product | Implemented in current workflow | Baseline documented and testable | Current state | P1 |
+| G-004 | Preserve approved-item export and downstream dataset handoff behavior | Operational | Implemented in current workflow | Baseline documented and testable | Current state | P1 |
+| G-005 | Preserve optimistic concurrency and assignment ownership protections | Technical | Implemented in current workflow | Baseline documented and testable | Current state | P0 |
+
+### Objectives (Optional)
+| Objective | Key Result | Priority | Owner |
+|-----------|------------|----------|-------|
+| Baseline the workflow | One current-state PRD covers assignment, curation, references, approval, export, and constraints | P0 | TBD |
+| Reduce ambiguity | Current-state requirements can be traced to repo specs and implementation | P0 | TBD |
+| Support safe future change | Future PRDs can diff against this baseline instead of inferring behavior from scattered docs | P1 | TBD |
+
+## 2. Problem Definition
+### Current Situation
+The RAG workflow already exists in production-oriented code and repo specifications, but the behavior is described across multiple surfaces: backend APIs, frontend component flows, specs, implementation notes, and operational documentation. Teams can understand pieces of the system, but there is no single current-state PRD that explains the workflow end to end.
+
+### Problem Statement
+Without a current-state PRD, future changes risk misrepresenting incumbent behavior, weakening approval or reference-quality guarantees, or changing workflow expectations for curators and downstream dataset consumers.
+
+### Root Causes
+* Workflow expectations are distributed across backend, frontend, and spec artifacts rather than consolidated in one baseline document
+* Some repo documents mix implemented behavior with future-oriented ideas, creating ambiguity about what is live today
+* Current-state quality gates and workflow dependencies are easier to infer from code than from a single product artifact
+
+### Impact of Inaction
+* Future initiatives may accidentally regress core RAG curation behavior
+* Product and engineering teams may debate current-state behavior from incomplete evidence
+* Downstream consumers may not have a stable description of what an approved RAG item guarantees
+
+## 3. Users & Personas
+| Persona | Goals | Pain Points | Impact |
+|---------|-------|------------|--------|
+| **Curator / SME** | Review assigned items efficiently, edit content, add grounded references, and approve high-quality RAG entries | Workflow rules live across UI behavior and validation logic; approval expectations can be misunderstood | Primary user of the curation workflow |
+| **Curation Lead** | Monitor work queues, throughput, and data quality expectations | Needs clarity on assignment, approval, and export semantics | Operational owner of workflow quality |
+| **ML / Evaluation Engineer** | Consume approved snapshots for downstream evaluation and benchmarking | Needs confidence that approved items satisfy consistent grounding rules | Downstream consumer of approved datasets |
+| **Platform Engineer** | Maintain APIs, storage, and deployment behavior without breaking workflow contracts | Current-state expectations are spread across code and docs | Maintains the workflow implementation |
+
+### Journeys (Optional)
+1. A curator requests or receives assignments from the queue.
+2. The curator opens an item in the editing workspace.
+3. The curator reviews question/answer content and, where supported, multi-turn history.
+4. The curator searches for references, attaches relevant sources, visits them, and records key excerpts.
+5. The curator updates tags or other metadata, saves changes, and resolves validation gaps.
+6. The curator approves, skips, deletes, restores, or exports items according to workflow state.
+
+## 4. Scope
+### In Scope
+* Current-state RAG workflow from assignment through approval and export
+* Current user-facing workflow expectations in queue, editor, and reference-management surfaces
+* Current backend expectations for ownership, optimistic concurrency, and snapshot export
+* Current data and provenance behaviors that influence approval quality and downstream consumption
+* Current non-functional expectations directly coupled to workflow integrity
+
+### Out of Scope (justify if empty)
+* New future-state feature ideation not supported by current repo evidence
+* Agentic-mode workflow requirements
+* Infrastructure redesign beyond what is necessary to describe current workflow dependencies
+* Implementation-level code structure details that do not materially change workflow behavior
+
+### Assumptions
+* The current workflow remains centered on RAG ground-truth curation rather than agentic trace review
+* Assignments and approved exports remain the primary operating model for curator throughput
+* The present system continues to rely on optimistic concurrency and assignment ownership for safe writes
+* Optional integrations such as Azure AI Search may vary by environment, while the baseline workflow remains reference-grounded
+
+### Constraints
+* Approval behavior must preserve current reference-quality gates
+* Assignment mutations must preserve ownership expectations and concurrency protections
+* Approved-item export must remain consistent enough for downstream consumers to ingest snapshots
+* Current-state requirements should describe what the system already does, not prescribe speculative redesign
+
+## 5. Product Overview
+### Value Proposition
+The current RAG workflow gives curators a structured way to convert candidate items into approved, reference-grounded ground truth. It combines queue management, editor tooling, reference search and annotation, validation, and export into a single curation loop intended to maintain quality and provenance.
+
+### Differentiators (Optional)
+* Assignment-based work acquisition instead of unmanaged item browsing alone
+* Reference-grounded approval gates that require evidence, not only content edits
+* Support for both baseline Q/A editing and broader conversation-history handling where the current implementation supports it
+* Snapshot export that turns approved curation results into downstream-consumable artifacts
+
+### UX / UI (Conditional)
+The current workflow is organized around three major user-facing areas:
+
+* **Queue / assignment surface** for requesting, browsing, and opening work items
+* **Curation editor** for editing item content, metadata, and status
+* **Reference management surface** for searching, selecting, visiting, and annotating grounding sources
+
+UX Status: Implemented current-state workflow with known doc ambiguities around some advanced search behaviors
+
+## 6. Functional Requirements
+| FR ID | Title | Description | Goals | Personas | Priority | Acceptance | Notes |
+|-------|-------|------------|-------|----------|----------|-----------|-------|
+| FR-001 | Self-serve assignment queue | The system shall support an assignment-based workflow where a curator can request items and receive work from a queue rather than relying only on freeform browsing. | G-001 | Curator / SME; Curation Lead | P0 | A curator can obtain assigned work items through the current assignment flow and view their assigned queue. | Source baseline: assignment workflow specs and assignment API/service surfaces |
+| FR-002 | Assignment ownership protection | The system shall preserve ownership rules for assigned draft work so one curator cannot silently overwrite another curator's active assignment. | G-005 | Curator / SME; Platform Engineer | P0 | Write attempts that violate assignment ownership are rejected with stable error handling. | Current-state behavior is coupled to assignment mutations and approval/update flows |
+| FR-003 | Curation workspace | The system shall provide a curation workspace that lets a curator inspect and edit the current item's answer content, related metadata, and workflow status. | G-001 | Curator / SME | P0 | A curator can open an assigned item, make edits, and save or transition workflow state. | Baseline from current frontend curation surfaces |
+| FR-004 | Multi-turn support where available | The system shall preserve current support for both traditional question/answer items and conversation-history editing where the current implementation supports multi-turn behavior. | G-003 | Curator / SME; ML / Evaluation Engineer | P1 | Current supported item shapes remain editable without breaking single-turn behavior. | Current repo evidence shows compatibility for multi-turn flows while preserving legacy shapes |
+| FR-005 | Reference search and selection | The system shall support searching for references and attaching selected references to the item under curation. | G-002 | Curator / SME | P0 | A curator can search for candidate references and add them to the curated item. | Search capability may vary by backend/provider, but the workflow expectation is present |
+| FR-006 | Reference visitation and key excerpts | The system shall track whether selected references were visited and capture key supporting excerpts that justify approval. | G-002 | Curator / SME; ML / Evaluation Engineer | P0 | Selected references can be visited and annotated with supporting excerpt content before approval. | Current workflow emphasizes visited state and minimum excerpt quality |
+| FR-007 | Approval gating by reference completeness | The system shall prevent approval unless the item satisfies current grounding rules, including having at least one selected reference and meeting current visitation and excerpt completeness requirements. | G-002 | Curator / SME; Curation Lead | P0 | Items that do not satisfy current reference gates cannot be approved. | This is a defining quality invariant of the current workflow |
+| FR-008 | Tag and metadata management | The system shall support curator-managed metadata updates, including tags, while preserving current normalization and computed-tag behavior. | G-001 | Curator / SME; Platform Engineer | P1 | Curators can manage supported metadata and stored tag values remain normalized and stable. | Tag semantics are part of the current curation loop and downstream data quality |
+| FR-009 | Workflow state transitions | The system shall support current state transitions such as save draft, approve, skip, soft delete, and restore according to current workflow rules. | G-001; G-005 | Curator / SME; Curation Lead | P0 | A curator can perform supported workflow actions and the resulting item state is persisted consistently. | Current workflow includes soft-delete and restore rather than destructive removal |
+| FR-010 | Snapshot export | The system shall support exporting approved items as snapshots suitable for downstream dataset consumption. | G-004 | ML / Evaluation Engineer; Curation Lead | P1 | Approved data can be exported using the current snapshot workflow and results are consumable downstream. | Current-state export supports attachment-oriented and artifact-oriented patterns |
+| FR-011 | Explorer and filtering | The system shall allow users to browse existing items using the current set of supported filters, including status, dataset, tags, and other implemented search constraints. | G-001 | Curator / SME; Curation Lead | P1 | Users can narrow visible items with currently supported filters without changing assignment semantics. | Some advanced filtering ideas remain future-oriented or partially documented |
+| FR-012 | Concurrency-safe updates | The system shall preserve optimistic concurrency on item updates so users do not unknowingly overwrite newer changes. | G-005 | Curator / SME; Platform Engineer | P0 | Updates require the latest concurrency token and return a conflict on stale writes. | Current state uses ETags as the primary concurrency contract |
+
+### Feature Hierarchy (Optional)
+```plain
+RAG Workflow
+|- Assignment and queue management
+|  |- Self-serve assignment
+|  |- Assigned queue view
+|  |- Ownership enforcement
+|- Curation workspace
+|  |- Item editing
+|  |- Multi-turn compatibility
+|  |- Tags and metadata
+|  |- State transitions
+|- Reference workflow
+|  |- Search
+|  |- Selection
+|  |- Visit tracking
+|  |- Key excerpt capture
+|  |- Approval gating
+|- Export and downstream handoff
+   |- Snapshot export
+   |- Approved item consumption
+```
+
+## 7. Non-Functional Requirements
+| NFR ID | Category | Requirement | Metric/Target | Priority | Validation | Notes |
+|--------|----------|------------|--------------|----------|-----------|-------|
+| NFR-001 | Reliability | The system shall expose a healthy backend service state suitable for operational checks before curation begins. | Health endpoint responds successfully in healthy environments | P1 | Operational smoke checks | Current backend baseline includes health checking |
+| NFR-002 | Concurrency | The system shall use optimistic concurrency controls on mutable workflows. | Stale writes are rejected rather than silently accepted | P0 | API conflict tests and manual verification | Directly protects curator work from accidental overwrite |
+| NFR-003 | Usability | Approval rules shall be visible through user-facing validation behavior rather than hidden post-submit failures alone. | Curators can identify unmet approval conditions before approval succeeds | P0 | UI validation and approval-path testing | Current workflow relies on explicit approval gates |
+| NFR-004 | Data Integrity | Tag storage and comparable metadata shall remain normalized and deterministic. | Stable normalized values across repeated save/read cycles | P1 | Unit tests and data round-trip checks | Supports downstream consistency |
+| NFR-005 | Provenance | Approved items shall retain reference provenance sufficient to explain grounding decisions. | Approved items preserve selected references and supporting excerpts | P0 | Export review and API payload inspection | Central to RAG quality expectations |
+| NFR-006 | Compatibility | The system shall accept supported input naming conventions while preserving stable output contracts for clients. | Current API payload contract remains interoperable with existing clients | P1 | API integration tests | Current backend behavior accepts variant casing and emits stable wire output |
+| NFR-007 | Security | The system shall preserve assignment ownership semantics and current user attribution behavior. | Mutations remain attributable to the acting user and ownership checks stay enforced | P1 | Assignment mutation tests | Production auth may vary by environment; dev simulation remains supported |
+| NFR-008 | Observability | The system should preserve safe-by-default observability behavior so telemetry is opt-in and non-blocking where configured. | Workflow remains usable with telemetry disabled or absent | P2 | Environment smoke tests | Current frontend observability pattern is intentionally safe by default |
+
+Categories: Performance, Reliability, Scalability, Security, Privacy, Accessibility, Observability, Maintainability, Localization (if), Compliance (if).
+
+## 8. Data & Analytics (Conditional)
+### Inputs
+* Candidate RAG ground-truth items entering the curation workflow
+* Reference search queries and selected reference metadata
+* Curator edits to answers, history, tags, and workflow status
+* User identity or simulated user headers used for assignment attribution
+
+### Outputs / Events
+* Updated item state, assignment state, and review metadata
+* Approved snapshots for downstream dataset use
+* Optional telemetry or operational traces when enabled
+
+### Instrumentation Plan
+| Event | Trigger | Payload | Purpose | Owner |
+|-------|---------|--------|---------|-------|
+| Assignment requested | Curator requests work | User, count, dataset context | Measure queue usage | TBD |
+| Item updated | Curator saves changes | Item id, user, status, concurrency outcome | Measure edit/save behavior | TBD |
+| Approval attempted | Curator attempts approval | Item id, validation outcome | Measure quality-gate friction | TBD |
+| Snapshot exported | User exports approved set | Dataset, mode, count | Measure downstream handoff | TBD |
+
+### Metrics & Success Criteria
+| Metric | Type | Baseline | Target | Window | Source |
+|--------|------|----------|--------|--------|--------|
+| Approval success rate after validation | Workflow quality | TBD | Monitor current state | Rolling | App/API telemetry or logs |
+| Average assignment-to-approval time | Operational | TBD | Monitor current state | Rolling | Assignment and review timestamps |
+| Reference completeness failure rate | Quality | TBD | Monitor current state | Rolling | Approval validation results |
+| Export success rate | Operational | TBD | Monitor current state | Rolling | Export logs or API results |
+
+## 9. Dependencies
+| Dependency | Type | Criticality | Owner | Risk | Mitigation |
+|-----------|------|------------|-------|------|-----------|
+| Repository-backed assignment and item storage | Data platform | High | TBD | Workflow breaks if reads/writes fail | Preserve repository abstractions and operational checks |
+| Reference search capability/provider | Search | High | TBD | Curators cannot attach supporting evidence efficiently | Keep baseline provider behavior or an acceptable fallback |
+| Frontend curation workspace | Product surface | High | TBD | Curators cannot edit or approve items | Preserve core editor flows and integration tests |
+| Snapshot export path | Downstream integration | Medium | TBD | Approved data cannot be handed off reliably | Preserve export contract and smoke-test it |
+| Identity / user attribution | Security / operations | Medium | TBD | Assignment ownership and auditing weaken | Preserve current auth or simulation mechanisms per environment |
+
+## 10. Risks & Mitigations
+| Risk ID | Description | Severity | Likelihood | Mitigation | Owner | Status |
+|---------|-------------|---------|-----------|-----------|-------|--------|
+| R-001 | Future changes weaken approval grounding rules | High | Medium | Treat FR-006 and FR-007 as regression-sensitive requirements | TBD | Open |
+| R-002 | Search-related docs and implementation drift further apart | Medium | Medium | Use this PRD as the baseline and resolve documented open questions | TBD | Open |
+| R-003 | Workflow changes bypass optimistic concurrency or ownership checks | High | Low | Preserve ETag and ownership tests on all write paths | TBD | Open |
+| R-004 | Export changes break downstream consumers silently | Medium | Medium | Preserve snapshot contract and verify against representative consumers | TBD | Open |
+
+## 11. Privacy, Security & Compliance
+### Data Classification
+The workflow manages curated dataset content, references, tags, and user-attribution metadata. Exact data classification depends on the dataset and deployment environment.
+
+### PII Handling
+Current repo evidence shows support for user attribution and PII-related service surfaces, but this document does not expand beyond the existing current-state workflow baseline.
+
+### Threat Considerations
+* Unauthorized or conflicting writes must be prevented through ownership and concurrency checks
+* Evidence and export data should preserve integrity during curation and handoff
+* Environment-specific authentication must not undermine assignment semantics
+
+### Regulatory / Compliance (Conditional)
+| Regulation | Applicability | Action | Owner | Status |
+|-----------|--------------|--------|-------|--------|
+| TBD | Environment-specific | Confirm per deployment | TBD | Open |
+
+## 12. Operational Considerations
+| Aspect | Requirement | Notes |
+|--------|------------|-------|
+| Deployment | The workflow shall remain deployable in local and hosted environments supported by the existing repo | Current repo supports local development plus Azure-oriented deployment patterns |
+| Rollback | Changes affecting assignment, approval, or export should be rollbackable without data-loss surprises | Preserve current contracts before rolling out workflow changes |
+| Monitoring | Operators should be able to confirm service health and core workflow readiness | Health and optional telemetry patterns already exist |
+| Alerting | Critical workflow failures should surface through existing operational channels | Environment-specific implementation may vary |
+| Support | Support teams need a clear baseline of current-state behavior when triaging workflow issues | This PRD is intended to be that baseline |
+| Capacity Planning | Capacity depends on dataset size, search provider behavior, and curator throughput | Current-state PRD does not redefine platform sizing |
+
+## 13. Rollout & Launch Plan
+### Phases / Milestones
+| Phase | Date | Gate Criteria | Owner |
+|-------|------|--------------|-------|
+| Current-state baseline authored | 2026-03-11 | PRD created and source-backed | Copilot / TBD |
+| Stakeholder review | TBD | Product and engineering confirm baseline accuracy | TBD |
+| Baseline adoption | TBD | Future PRDs reference this document for delta analysis | TBD |
+
+### Feature Flags (Conditional)
+| Flag | Purpose | Default | Sunset Criteria |
+|------|---------|--------|----------------|
+| Demo / mock provider configuration | Support non-production workflows where enabled | Environment-specific | Remains as long as demo workflows are supported |
+| Optional telemetry configuration | Enable observability without making it mandatory | Disabled unless configured | Remains environment-specific |
+
+### Communication Plan (Optional)
+Share this baseline PRD with product, frontend, backend, and operations stakeholders before approving workflow changes that touch assignment, reference validation, or export behavior.
+
+## 14. Open Questions
+| Q ID | Question | Owner | Deadline | Status |
+|------|----------|-------|---------|--------|
+| Q-001 | Should reference search be treated as universally required current-state capability, or is provider-backed search optional in some environments? | TBD | TBD | Open |
+| Q-002 | Which advanced filter behaviors are truly current state versus future-oriented specs? | TBD | TBD | Open |
+| Q-003 | Are dataset-level curation instructions part of the baseline workflow everywhere or only in specific flows? | TBD | TBD | Open |
+
+## 15. Changelog
+| Version | Date | Author | Summary | Type |
+|---------|------|-------|---------|------|
+| 1.0 | 2026-03-11 | Copilot | Created current-state baseline PRD for the RAG workflow from repo evidence | Added |
+
+## 16. References & Provenance
+| Ref ID | Type | Source | Summary | Conflict Resolution |
+|--------|------|--------|---------|--------------------|
+| REF-001 | Spec | `specs/assignment-workflow.md` | Assignment-centric curation workflow and self-serve queue behavior | Used as assignment baseline |
+| REF-002 | Spec | `specs/explorer-view.md` | Explorer and filter expectations for current browsing workflow | Open question retained for advanced filters |
+| REF-003 | Spec | `specs/curation-editor.md` | Editor expectations for curation actions and item editing | Used as curation workspace baseline |
+| REF-004 | Spec | `specs/reference-management.md` | Reference search, selection, visitation, and excerpt behavior | Used as evidence workflow baseline |
+| REF-005 | Spec | `specs/export-snapshots.md` | Snapshot export patterns and downstream handoff behavior | Used as export baseline |
+| REF-006 | Spec | `specs/data-persistence.md` | Data, persistence, and workflow write expectations | Used as storage/concurrency support |
+| REF-007 | Codebase guide | `frontend/CODEBASE.md` | Frontend workflow surfaces, validation cues, and user interactions | Used when summarizing UX baseline |
+| REF-008 | Codebase guide | `backend/CODEBASE.md` | Backend API and data contract expectations | Used for API and platform baseline |
+| REF-009 | API/service implementation | `backend/app/api/v1/assignments.py`, `backend/app/api/v1/ground_truths.py`, `backend/app/services/assignment_service.py` | Concrete assignment, state transition, and workflow behavior | Used to anchor current-state implementation claims |
+| REF-010 | Research note | `.copilot-tracking/research/20260121-high-level-requirements-research.md` | Prior repo-backed requirements synthesis | Used as supporting consolidation evidence |
+
+### Citation Usage
+This PRD intentionally describes **current-state** behavior only. Where repo artifacts disagree or mix implemented and planned behavior, the more conservative interpretation was used and the ambiguity was captured in Open Questions rather than resolved by speculation.
+
+## 17. Appendices (Optional)
+### Glossary
+| Term | Definition |
+|------|-----------|
+| RAG | Retrieval-Augmented Generation workflow centered on grounded references |
+| Curator / SME | User responsible for reviewing, editing, and approving ground-truth items |
+| Reference | Supporting external source attached to an item to justify grounding |
+| Snapshot export | Exported representation of approved items for downstream consumption |
+| Optimistic concurrency | Update safety model that rejects stale writes instead of silently overwriting |
+
+### Additional Notes
+This document is intentionally a **baseline PRD** for incumbent behavior. It should be used to compare future changes, not as evidence that every adjacent idea in repo docs is currently implemented.
+
+Generated 2026-03-11T12:34:07Z by Copilot (mode: full)
+<!-- markdown-table-prettify-ignore-end -->
diff --git a/frontend/.env.example b/frontend/.env.example
index 32b2fd3..dc638a4 100644
--- a/frontend/.env.example
+++ b/frontend/.env.example
@@ -2,6 +2,9 @@
 # Base URL of the backend API
 VITE_API_BASE_URL=http://localhost:8000
 
+# Optional app base path when hosted under a virtual directory (for example /gtc/)
+VITE_APP_BASE_PATH=
+
 # Application Branding
 # Title displayed in the header (default: Ground Truth Curator)
 VITE_APP_TITLE=Ground Truth Curator
diff --git a/frontend/.github/copilot-instructions.md b/frontend/.github/copilot-instructions.md
index 6d81821..fa4b6a2 100644
--- a/frontend/.github/copilot-instructions.md
+++ b/frontend/.github/copilot-instructions.md
@@ -1 +1,8 @@
-Use the show-notification tool from notify MCP when you're done working.
\ No newline at end of file
+# Frontend Copilot Instructions
+
+- Stack: Vite 7, React 19, TypeScript, Tailwind CSS v4, Biome, Vitest, and `openapi-fetch`.
+- For React performance, rendering, bundle, and client data-flow work, consult `.github/skills/react-best-practices/SKILL.md` and `.github/skills/react-best-practices/APPLICABILITY.md`.
+- Apply general React and Vite guidance. Do not copy Next.js-only patterns such as `next/dynamic`, API routes, server actions, `after()`, or RSC/server-only optimizations into this frontend.
+- Treat SWR guidance as reference-only unless SWR is intentionally added to the frontend dependencies.
+- Keep HTTP calls in `src/api/` or `src/services/`, not presentational components.
+- Validate frontend changes with `npm run lint:check`, `npm run typecheck`, and for behavior changes `npm run test:run -- --pool=threads --poolOptions.threads.singleThread`.
diff --git a/frontend/CODEBASE.md b/frontend/CODEBASE.md
index 908d687..0581a46 100644
--- a/frontend/CODEBASE.md
+++ b/frontend/CODEBASE.md
@@ -76,21 +76,22 @@ Key concepts:
 - Provider abstraction: `Provider` with list/get/save/export. `ApiProvider` implements REST calls and ETag concurrency.
 - DEMO mode: `src/config/demo.ts` computes a boolean from `DEMO_MODE`/`VITE_DEMO_MODE`; when true, the app uses `JsonProvider` and mocks for search/LLM/stats.
 - Fingerprints: `itemVersionFingerprint` (content only) and `itemStateFingerprint` (content + status/deleted) drive idempotency and version bumping. Tags are part of content.
-- Approval constraints: `canApproveCandidate(item)` requires at least one selected reference AND that `refsApprovalReady(item)` passes (all refs visited; selected refs have ≥40 char key paragraph). Deleted items cannot be approved.
-- UX separation: Left queue, center editor (Question/Answer + actions), right references pane (Search vs Selected tabs), stats view, and modal overlays.
+- Approval constraints: generic multi-turn approval is conversation- and expected-tools-driven; reference visit/key-paragraph rules apply only when an active compatibility or plugin workflow opts into them. Deleted items cannot be approved.
+- UX separation: Left queue, center editor (conversation + actions), right evidence/review host, stats view, and modal overlays.
 
 ## Data Models (from `src/models/groundTruth.ts`)
 
-- Reference: { id, title?, url, snippet?, visitedAt?, keyParagraph?, selected? }
+- Reference: { id, title?, url, snippet?, visitedAt?, keyParagraph?, messageIndex?, turnId?, toolCallId? }. Treat `turnId` as the stable ownership contract; `messageIndex` remains a migration fallback only.
 - GroundTruthItem: {
-  id, question, answer,
-  references: Reference[],
+  id,
+  history?: ConversationTurn[] with stable `turnId` / optional `stepId`,
+  references: plugin-owned retrieval state projected as Reference[],
   status: "draft" | "approved" | "skipped",
   providerId, version, deleted?,
   tags?: string[],
   curationInstructions?: string
 }
-- Change category: previously required when Q/A changed; no longer enforced. A legacy `ChangeCategorySelector` component exists but is not wired into save logic.
+- Canonical editing state is `history[]`; display helpers like `getLastUserTurn()`, `getLastAgentTurn()`, and `getQueuePreview()` extract values from turns on-demand.
 
 ## Provider Contract (from `src/models/provider.ts`)
 
@@ -117,7 +118,7 @@ Providers:
 ## Validation (from `src/models/validators.ts` + `gtHelpers.ts`)
 
 - `refsApprovalReady(item)`
-  - If references exist: all must be visited; selected references must have keyParagraph length >= 40
+  - Transitional helper for RAG-compat workflows only. Do not treat it as the generic approval contract.
   - If no references: this primitive returns true, but is combined with:
 - `canApproveCandidate(item)` – requires at least one selected reference, `refsApprovalReady(item)` to pass, and item not deleted
 
@@ -129,7 +130,7 @@ High-level state:
 - selectedId: current item id; `current`: editable clone
 - qaChanged: computed against baseline
 - viewMode: "curate" (default) vs "questions" (list-level delete/restore) vs "stats"
-- right panel tab: "search" vs "selected"
+  - right evidence panel state: attached-evidence review surface with optional plugin-provided search capabilities
 - search state: query, results, selection set, searching flag
 - ref opening: marks visited and opens in a new tab (in-app iframe preview is removed)
 - saving and lastSavedStateFp: no-op and double-click idempotency
@@ -138,9 +139,9 @@ High-level state:
 Key flows:
 1) Load + select first item – provider initialized on mount; list loaded; selection deep-cloned; QA baseline and search reset; saved fingerprint captured.
 2) Edit Question/Answer – textareas bound to `current`; `qaChanged` reflects differences; change category is NOT required anymore.
-3) References – right panel
-  - Search tab: performs backend `searchReferences` (or `mockAiSearch` in demo), displays results, supports multi-select Add and individual Add; disabled when URL already present; de-dup by URL.
-  - Selected tab: lists references with selection toggle, visit/open (sets `visitedAt`), key paragraph with counter; Remove supports Undo (8s window).
+3) Evidence review surfaces
+   - Plugin-provided search surface: performs backend `searchReferences` (or `mockAiSearch` in demo) for workflows that expose host-owned retrieval acquisition via plugin extensions.
+   - Attached evidence review: lists retrieved or plugin-projected references, supports visit/open, key paragraph editing, and Remove with Undo (8s window).
 4) Generate Answer – opens modal listing selected references; on confirm, calls backend `callAgentChat`. UI currently applies the full answer at once.
 5) Save – computes state fingerprint; if unchanged: returns "No changes". If approving, validates `canApproveCandidate`. On success, updates `items`, `current`, and lastSavedStateFp. Status-only saves do not bump version. Content changes (Q/A/refs/tags) bump version (self-tests run in dev).
 6) Export – triggers backend snapshot download via `groundTruths.downloadSnapshot()`; no in-app JSON modal.
@@ -155,13 +156,13 @@ Self-tests: `runSelfTests()` asserts validator rules and provider bump rules in
   - Inputs: items, selectedId, onSelect, onRefresh, onSelfServe
   - Shows id, status badge, version, and question (truncated); highlights deleted
 - ReferencesTabs
-  - Props split for SearchTab and SelectedTab; handles tab switch and counts selected refs
+  - Generic evidence/review host that shows plugin-provided search capabilities plus attached evidence review
 - SearchTab
   - Inputs: query, results, selection set, existingReferences, callbacks
-  - Buttons: Search, Add, open in new tab, sticky "Add N to Selected"
+  - Used only when the current workflow still exposes host-owned retrieval acquisition
 - SelectedTab
   - Inputs: references and callbacks (update/remove/open)
-  - Shows visit status, selection toggle, key paragraph with counter; Remove triggers confirmation and Undo toast
+  - Shows attached evidence from plugin-owned retrieval state or per-call candidates
 - TagsEditor
   - Inputs: selected tags; allows add/remove
 - InstructionsPane
@@ -198,9 +199,9 @@ Self-tests: `runSelfTests()` asserts validator rules and provider bump rules in
   - Extend `models/validators.ts` and/or `gtHelpers.ts`
   - Enforce in `useGroundTruth.save` before invoking provider
 
-4) Add a new tab to the right panel
-  - Extend `ReferencesTabs.tsx` with a new discriminated union value and conditional render
-  - Keep props isolated per tab to avoid cross-tab dependencies
+4) Add a new evidence/review surface
+   - Prefer contributing plugin-owned panels through the registry/TracePanel path before expanding the shared host pane
+   - Plugin-provided search surfaces are isolated as optional extensions; the core UI works without them
 
 5) Export mechanics
   - UI uses backend snapshot download (`groundTruths.downloadSnapshot`). If you need in-app preview, reintroduce an `ExportModal` backed by `provider.export()` and/or the snapshot payload.
@@ -211,7 +212,7 @@ Self-tests: `runSelfTests()` asserts validator rules and provider bump rules in
 ## Gotchas and Invariants
 
 - Version rules: only content changes bump version. Content includes Q/A, references, and tags. Do not bump for status-only or no-op saves. Fingerprints enforce this—self-tests run in development.
-- Approval gating: requires at least one selected reference; if references exist, all must be visited; selected refs require ≥ 40 chars key paragraph; deleted items cannot be approved.
+- Approval gating: generic approval follows conversation + expected-tools rules. Reference completeness is a compat/plugin gate, not the universal host contract. Deleted items cannot be approved.
 - Undo delete window: 8 seconds via toast action; ensure timers cleared on unmount in `useToasts`.
 - Deep clone on selection: ensures edits don’t mutate provider list until Save.
 - De-dup by URL when adding references from search.
@@ -234,9 +235,17 @@ Error modes
 
 Success criteria
 - No-op or status-only saves do not bump version
-- Approval only allowed when validator passes
+- Approval only allowed when the active generic or plugin-specific validator passes
 - Undo for ref deletion works within 8 seconds
 
+## Migration Completed
+
+The frontend has completed migration to canonical multi-turn state:
+- Editing state is `history[]` with stable `turnId` identity.
+- Registry-driven rendering and plugin-owned evidence surfaces are in production.
+- Display helpers extract user/agent messages from turns on-demand.
+- API compatibility projections live in `adapters/apiMapper.ts` boundary layer only.
+
 ## Running and Verifying
 
 - Start: `npm run dev`
@@ -246,7 +255,7 @@ Success criteria
 
 Manual smoke test
 - Load app, verify first item selected
-- Toggle references visited and add key paragraphs; attempt Approve gating (requires ≥1 selected ref)
+- Toggle evidence visited state and key paragraphs on a compatibility item; attempt Approve gating and confirm plugin-specific rules only apply where expected
 - Run Search, add results, ensure de-dup by URL
 - Generate Answer populates answer text
 - Save Draft vs Approve follows version bump rules
diff --git a/frontend/README.md b/frontend/README.md
index a3dbc5e..c2d30cd 100644
--- a/frontend/README.md
+++ b/frontend/README.md
@@ -24,11 +24,12 @@ Welcome. This is the React + Vite + TypeScript frontend for Ground Truth Curatio
 
 2) Configure environment
 
-   - Copy `.env.example` to `.env.local` and adjust as needed:
-     - `VITE_API_BASE_URL` – backend base URL (default `http://localhost:8000`)
-     - `VITE_OPENAPI_URL` – OpenAPI spec URL used for type generation
-     - `VITE_DEV_USER_ID` – optional dev-only user id sent as `X-User-Id`
-     - `VITE_SELF_SERVE_LIMIT` – optional default for self-serve assignments
+    - Copy `.env.example` to `.env.local` and adjust as needed:
+      - `VITE_API_BASE_URL` – backend base URL (default `http://localhost:8000`)
+      - `VITE_APP_BASE_PATH` – optional virtual directory such as `/gtc/` for non-root hosting
+      - `VITE_OPENAPI_URL` – OpenAPI spec URL used for type generation
+      - `VITE_DEV_USER_ID` – optional dev-only user id sent as `X-User-Id`
+      - `VITE_SELF_SERVE_LIMIT` – optional default for self-serve assignments
 
 3) Start the app
 
@@ -44,6 +45,7 @@ This app calls the backend under the `/v1` path. In development, Vite is configu
 
 - Change the target host by editing `VITE_API_BASE_URL` in `.env.local`.
 - Keep frontend calls relative (e.g., `fetch('/v1/ground-truths')`) so the proxy works seamlessly.
+- If the app is hosted under a virtual directory such as `/gtc/`, set `VITE_APP_BASE_PATH=/gtc/` so built asset URLs and same-origin API calls include that prefix.
 - Optionally set `VITE_DEV_USER_ID` in `.env.local` to send an `X-User-Id` header for dev-only flows.
 
 More details: `CONNECT_TO_BACKEND.md`.
@@ -79,6 +81,14 @@ The UI includes a demo flow. You can toggle via env at startup:
 
 The build injects `import.meta.env.DEMO_MODE` to match `VITE_DEMO_MODE` for client code.
 
+For the full local demo stack from the repository root, use:
+
+```bash
+VITE_DEMO_MODE=true VITE_DEV_USER_ID=demo-user make dev-up
+```
+
+`VITE_DEMO_MODE=true` enables the demo UI behavior, while `VITE_DEV_USER_ID=demo-user` gives the backend a stable dev/demo identity for assignment-backed flows.
+
 ## Telemetry (optional)
 
 Client telemetry is initialized early and can be configured via Vite env vars:
diff --git a/frontend/biome.json b/frontend/biome.json
index 1213e9d..db3a5bb 100644
--- a/frontend/biome.json
+++ b/frontend/biome.json
@@ -1,5 +1,5 @@
 {
-	"$schema": "https://biomejs.dev/schemas/2.2.0/schema.json",
+	"$schema": "https://biomejs.dev/schemas/2.4.6/schema.json",
 	"formatter": {
 		"enabled": true
 	},
@@ -22,7 +22,10 @@
 			"**",
 			"!**/dist",
 			"!**/*.generated.js",
-			"!src/api/generated.ts"
+			"!src/api/generated.ts",
+			"!**/node_modules",
+			"!**/coverage",
+			"!**/build"
 		]
 	},
 	"vcs": {
diff --git a/frontend/docs/REFACTORING_PLAN.md b/frontend/docs/REFACTORING_PLAN.md
index 6d46148..f850a95 100644
--- a/frontend/docs/REFACTORING_PLAN.md
+++ b/frontend/docs/REFACTORING_PLAN.md
@@ -1,5 +1,7 @@
 # Frontend Refactoring Plan — Ground Truth Curator
 
+> Migration note: this document now describes the generic host plus plugin-owned evidence direction. Older single-turn and global references-tab assumptions are kept only when they still describe an active compatibility path.
+
 This plan outlines how to refactor the current 1-file demo (`src/demo.tsx`) into cohesive, testable modules using the already-established patterns (components in `components/*`, hooks in `hooks/*`, models in `models/*`, services in `services/*`). The goal is to preserve behavior while improving structure, maintainability, and testability.
 
 ## Goals
@@ -17,7 +19,7 @@ This plan outlines how to refactor the current 1-file demo (`src/demo.tsx`) into
 ## Current State (high-level)
 - Single container component: `src/demo.tsx` holds:
   - App header, view toggles, JSON export, in-app preview toggle.
-  - Left `QueueSidebar` (already extracted), center QA editor, right references tabs (already extracted as `ReferencesTabs`).
+  - Left `QueueSidebar` (already extracted), center multi-turn editor, right evidence/review host (`ReferencesTabs` only when the workflow still uses the shared compatibility surface).
   - Modals already extracted: `GenerateAnswerModal`, `ExportModal`.
   - Overlay already extracted: `ReferenceViewer`.
   - Toast system via `useToasts` and `Toasts` component.
@@ -35,12 +37,8 @@ This plan outlines how to refactor the current 1-file demo (`src/demo.tsx`) into
   - `hooks/useGroundTruth.ts` (or context provider `GroundTruthProvider` if we want React Context). Start with a hook; promote to context only if needed.
 - UI composed of small components:
   - HeaderBar (view toggles, preview toggle, export)
-  - Editor panel broken into:
-    - `QuestionEditor`
-    - `AnswerEditor`
-    - `ChangeCategorySelector`
-    - `SaveControls`
-  - References panel stays under `components/app/ReferencesPanel/*` and receives update/remove/open callbacks from the hook.
+  - Editor panel broken into multi-turn conversation editing plus save controls
+  - Evidence/review host under `components/app/pages/ReferencesSection.tsx`, with `components/app/ReferencesPanel/*` reserved for the shared compatibility surface
   - Explorer view list extracted to `QuestionsList`.
 - Data/Provider boundary
   - Keep `JsonProvider` as the backing implementation.
@@ -71,7 +69,7 @@ This plan outlines how to refactor the current 1-file demo (`src/demo.tsx`) into
 
 4) Tests
 - Add `frontend/src/__tests__/provider.spec.ts` covering versioning bump rules (already asserted in `runSelfTests`).
-- Add `frontend/src/__tests__/validators.spec.ts` covering `refsApprovalReady` edge cases.
+- Add validator coverage for generic conversation/expected-tools approval first; keep `refsApprovalReady` coverage only for RAG-compat migration paths.
 - Add minimal component smoke tests where practical.
 
 5) Cleanup
@@ -93,8 +91,8 @@ From `src/demo.tsx` → new modules:
 
 - Search integration
   - `runSearch`, `addRefsFromResults`, `toggleSelectSearchResult`, `addSelectedFromResults` → Split:
-    - Search execution stays in hook (`runSearch`),
-    - Selection state for search results can live in `ReferencesTabs` (UI-local) or in the hook if shared; prefer local to the right panel.
+    - Search execution stays in hook (`runSearch`) for the remaining compatibility surface
+    - Plugin-owned retrieval acquisition should stay out of the shared right-pane host and surface through plugin panels instead.
 
 - Generation
   - `onGenerate`, `doGenerateApply` → Expose `generateDraftAnswer` action in `useGroundTruth` (implementation calls `services/llm` mocks). Modal visibility remains in the view layer.
@@ -115,7 +113,8 @@ From `src/demo.tsx` → new modules:
   - Implementation details
     - Holds `providerRef` and memoizes.
     - Encapsulates `itemStateFingerprint` logic.
-    - Validates via `refsApprovalReady` and QA change category rules.
+    - Treats `history[]` with stable `turnId` / `stepId` as canonical editing state, with `question` / `answer` retained only as migration projections.
+    - Uses generic approval rules first, with reference-specific validation isolated to compat/plugin behavior.
 
 - `src/components/app/HeaderBar.tsx`
   - Props: sidebarOpen/toggle, viewMode/toggle, inAppPreview/toggle, onExport
@@ -136,17 +135,24 @@ From `src/demo.tsx` → new modules:
   - Props: canApprove, saving, isDeleted, onSaveDraft, onApprove, onDelete, onRestore
 
 - Keep existing:
-  - `QueueSidebar`, `ReferencesTabs`, `GenerateAnswerModal`, `ExportModal`, `ReferenceViewer`, `Toasts`
+  - `QueueSidebar`, `ReferencesTabs` (compatibility surface), `GenerateAnswerModal`, `ExportModal`, `ReferenceViewer`, `Toasts`
 
 ## Validation and Versioning Rules (kept)
 - Status-only saves do not bump version.
 - Content changes (question/answer/references) bump version.
-- Approval requires all refs visited, and selected refs must have a key paragraph ≥ 40 chars.
+- Approval is generic by default; reference visit/key-paragraph rules are a compat/plugin concern while legacy RAG workflows still exist.
+
+## Phase 1 Migration Inventory
+
+- Keep: `useGroundTruth` as the shared editing boundary, but make history the canonical host contract.
+- Rewrite: any remaining guidance that treats top-level question/answer or refs-only approval as the permanent architecture.
+- Narrow: mapping/provider tests that still hard-code legacy single-turn conversion.
+- Delete with shim: temporary references-only guidance once the legacy adapter path is removed.
 
 ## Testing Plan
 - Unit tests using Vitest (or your current test runner):
   - Provider/versioning: replicate `runSelfTests` into tests and remove from runtime.
-  - Validators: `refsApprovalReady` cases.
+  - Validators: generic approval rules first, then `refsApprovalReady` cases for the shrinking compat surface.
   - Hook: small tests around `itemStateFingerprint` behavior and save gating (optional if time-bound).
 
 ## Rollout Strategy
@@ -165,7 +171,7 @@ From `src/demo.tsx` → new modules:
   - Mitigation: keep selection local to `ReferencesTabs` and pass chosen refs back up.
 
 ## Acceptance Criteria (DoD)
-- Behavior remains unchanged (manual smoke test: save, approve gating, delete/restore, add/remove refs, generate draft, export JSON, in-app preview toggling).
+- Behavior remains aligned with the generic host and active compatibility flows (manual smoke test: save, generic approve gating, delete/restore, add/remove evidence on compat items, generate draft, export JSON, in-app preview toggling).
 - `demo.tsx` becomes a thin composition root; core logic moved to `hooks/useGroundTruth.ts` and child components.
 - Self-tests moved into unit tests; no console.assert in production code.
 - Lint/build pass without new warnings.
diff --git a/frontend/index.html b/frontend/index.html
index b20a845..59bbbea 100644
--- a/frontend/index.html
+++ b/frontend/index.html
@@ -2,7 +2,7 @@
 <html lang="en">
   <head>
     <meta charset="UTF-8" />
-    <link rel="icon" type="image/svg+xml" href="/favicon.svg" />
+    <link rel="icon" type="image/svg+xml" href="%BASE_URL%favicon.svg" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <meta name="theme-color" content="#111827" />
     <title>Ground Truth Curator</title>
@@ -10,6 +10,6 @@
   <body>
     <div id="root"></div>
     <div id="modal-root"></div>
-    <script type="module" src="/src/main.tsx"></script>
+    <script type="module" src="%BASE_URL%src/main.tsx"></script>
   </body>
 </html>
diff --git a/frontend/package-lock.json b/frontend/package-lock.json
index 159119d..772bb74 100644
--- a/frontend/package-lock.json
+++ b/frontend/package-lock.json
@@ -21,10 +21,12 @@
 				"react": "^19.1.1",
 				"react-dom": "^19.1.1",
 				"react-markdown": "^9.0.3",
+				"react-resizable-panels": "^4.7.2",
 				"remark-gfm": "^4.0.0"
 			},
 			"devDependencies": {
-				"@biomejs/biome": "2.2.0",
+				"@biomejs/biome": "2.4.6",
+				"@playwright/test": "^1.58.2",
 				"@testing-library/jest-dom": "^6.6.3",
 				"@testing-library/react": "^16.1.0",
 				"@types/node": "^24.3.0",
@@ -374,9 +376,9 @@
 			}
 		},
 		"node_modules/@biomejs/biome": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/biome/-/biome-2.2.0.tgz",
-			"integrity": "sha512-3On3RSYLsX+n9KnoSgfoYlckYBoU6VRM22cw1gB4Y0OuUVSYd/O/2saOJMrA4HFfA1Ff0eacOvMN1yAAvHtzIw==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/biome/-/biome-2.4.6.tgz",
+			"integrity": "sha512-QnHe81PMslpy3mnpL8DnO2M4S4ZnYPkjlGCLWBZT/3R9M6b5daArWMMtEfP52/n174RKnwRIf3oT8+wc9ihSfQ==",
 			"dev": true,
 			"license": "MIT OR Apache-2.0",
 			"bin": {
@@ -390,20 +392,20 @@
 				"url": "https://opencollective.com/biome"
 			},
 			"optionalDependencies": {
-				"@biomejs/cli-darwin-arm64": "2.2.0",
-				"@biomejs/cli-darwin-x64": "2.2.0",
-				"@biomejs/cli-linux-arm64": "2.2.0",
-				"@biomejs/cli-linux-arm64-musl": "2.2.0",
-				"@biomejs/cli-linux-x64": "2.2.0",
-				"@biomejs/cli-linux-x64-musl": "2.2.0",
-				"@biomejs/cli-win32-arm64": "2.2.0",
-				"@biomejs/cli-win32-x64": "2.2.0"
+				"@biomejs/cli-darwin-arm64": "2.4.6",
+				"@biomejs/cli-darwin-x64": "2.4.6",
+				"@biomejs/cli-linux-arm64": "2.4.6",
+				"@biomejs/cli-linux-arm64-musl": "2.4.6",
+				"@biomejs/cli-linux-x64": "2.4.6",
+				"@biomejs/cli-linux-x64-musl": "2.4.6",
+				"@biomejs/cli-win32-arm64": "2.4.6",
+				"@biomejs/cli-win32-x64": "2.4.6"
 			}
 		},
 		"node_modules/@biomejs/cli-darwin-arm64": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-darwin-arm64/-/cli-darwin-arm64-2.2.0.tgz",
-			"integrity": "sha512-zKbwUUh+9uFmWfS8IFxmVD6XwqFcENjZvEyfOxHs1epjdH3wyyMQG80FGDsmauPwS2r5kXdEM0v/+dTIA9FXAg==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-darwin-arm64/-/cli-darwin-arm64-2.4.6.tgz",
+			"integrity": "sha512-NW18GSyxr+8sJIqgoGwVp5Zqm4SALH4b4gftIA0n62PTuBs6G2tHlwNAOj0Vq0KKSs7Sf88VjjmHh0O36EnzrQ==",
 			"cpu": [
 				"arm64"
 			],
@@ -418,9 +420,9 @@
 			}
 		},
 		"node_modules/@biomejs/cli-darwin-x64": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-darwin-x64/-/cli-darwin-x64-2.2.0.tgz",
-			"integrity": "sha512-+OmT4dsX2eTfhD5crUOPw3RPhaR+SKVspvGVmSdZ9y9O/AgL8pla6T4hOn1q+VAFBHuHhsdxDRJgFCSC7RaMOw==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-darwin-x64/-/cli-darwin-x64-2.4.6.tgz",
+			"integrity": "sha512-4uiE/9tuI7cnjtY9b07RgS7gGyYOAfIAGeVJWEfeCnAarOAS7qVmuRyX6d7JTKw28/mt+rUzMasYeZ+0R/U1Mw==",
 			"cpu": [
 				"x64"
 			],
@@ -435,9 +437,9 @@
 			}
 		},
 		"node_modules/@biomejs/cli-linux-arm64": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-arm64/-/cli-linux-arm64-2.2.0.tgz",
-			"integrity": "sha512-6eoRdF2yW5FnW9Lpeivh7Mayhq0KDdaDMYOJnH9aT02KuSIX5V1HmWJCQQPwIQbhDh68Zrcpl8inRlTEan0SXw==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-arm64/-/cli-linux-arm64-2.4.6.tgz",
+			"integrity": "sha512-kMLaI7OF5GN1Q8Doymjro1P8rVEoy7BKQALNz6fiR8IC1WKduoNyteBtJlHT7ASIL0Cx2jR6VUOBIbcB1B8pew==",
 			"cpu": [
 				"arm64"
 			],
@@ -452,9 +454,9 @@
 			}
 		},
 		"node_modules/@biomejs/cli-linux-arm64-musl": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-arm64-musl/-/cli-linux-arm64-musl-2.2.0.tgz",
-			"integrity": "sha512-egKpOa+4FL9YO+SMUMLUvf543cprjevNc3CAgDNFLcjknuNMcZ0GLJYa3EGTCR2xIkIUJDVneBV3O9OcIlCEZQ==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-arm64-musl/-/cli-linux-arm64-musl-2.4.6.tgz",
+			"integrity": "sha512-F/JdB7eN22txiTqHM5KhIVt0jVkzZwVYrdTR1O3Y4auBOQcXxHK4dxULf4z43QyZI5tsnQJrRBHZy7wwtL+B3A==",
 			"cpu": [
 				"arm64"
 			],
@@ -469,9 +471,9 @@
 			}
 		},
 		"node_modules/@biomejs/cli-linux-x64": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-x64/-/cli-linux-x64-2.2.0.tgz",
-			"integrity": "sha512-5UmQx/OZAfJfi25zAnAGHUMuOd+LOsliIt119x2soA2gLggQYrVPA+2kMUxR6Mw5M1deUF/AWWP2qpxgH7Nyfw==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-x64/-/cli-linux-x64-2.4.6.tgz",
+			"integrity": "sha512-oHXmUFEoH8Lql1xfc3QkFLiC1hGR7qedv5eKNlC185or+o4/4HiaU7vYODAH3peRCfsuLr1g6v2fK9dFFOYdyw==",
 			"cpu": [
 				"x64"
 			],
@@ -486,9 +488,9 @@
 			}
 		},
 		"node_modules/@biomejs/cli-linux-x64-musl": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-x64-musl/-/cli-linux-x64-musl-2.2.0.tgz",
-			"integrity": "sha512-I5J85yWwUWpgJyC1CcytNSGusu2p9HjDnOPAFG4Y515hwRD0jpR9sT9/T1cKHtuCvEQ/sBvx+6zhz9l9wEJGAg==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-linux-x64-musl/-/cli-linux-x64-musl-2.4.6.tgz",
+			"integrity": "sha512-C9s98IPDu7DYarjlZNuzJKTjVHN03RUnmHV5htvqsx6vEUXCDSJ59DNwjKVD5XYoSS4N+BYhq3RTBAL8X6svEg==",
 			"cpu": [
 				"x64"
 			],
@@ -503,9 +505,9 @@
 			}
 		},
 		"node_modules/@biomejs/cli-win32-arm64": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-win32-arm64/-/cli-win32-arm64-2.2.0.tgz",
-			"integrity": "sha512-n9a1/f2CwIDmNMNkFs+JI0ZjFnMO0jdOyGNtihgUNFnlmd84yIYY2KMTBmMV58ZlVHjgmY5Y6E1hVTnSRieggA==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-win32-arm64/-/cli-win32-arm64-2.4.6.tgz",
+			"integrity": "sha512-xzThn87Pf3YrOGTEODFGONmqXpTwUNxovQb72iaUOdcw8sBSY3+3WD8Hm9IhMYLnPi0n32s3L3NWU6+eSjfqFg==",
 			"cpu": [
 				"arm64"
 			],
@@ -520,9 +522,9 @@
 			}
 		},
 		"node_modules/@biomejs/cli-win32-x64": {
-			"version": "2.2.0",
-			"resolved": "https://registry.npmjs.org/@biomejs/cli-win32-x64/-/cli-win32-x64-2.2.0.tgz",
-			"integrity": "sha512-Nawu5nHjP/zPKTIryh2AavzTc/KEg4um/MxWdXW0A6P/RZOyIpa7+QSjeXwAwX/utJGaCoXRPWtF3m5U/bB3Ww==",
+			"version": "2.4.6",
+			"resolved": "https://registry.npmjs.org/@biomejs/cli-win32-x64/-/cli-win32-x64-2.4.6.tgz",
+			"integrity": "sha512-7++XhnsPlr1HDbor5amovPjOH6vsrFOCdp93iKXhFn6bcMUI6soodj3WWKfgEO6JosKU1W5n3uky3WW9RlRjTg==",
 			"cpu": [
 				"x64"
 			],
@@ -1882,6 +1884,22 @@
 				"win32"
 			]
 		},
+		"node_modules/@playwright/test": {
+			"version": "1.58.2",
+			"resolved": "https://registry.npmjs.org/@playwright/test/-/test-1.58.2.tgz",
+			"integrity": "sha512-akea+6bHYBBfA9uQqSYmlJXn61cTa+jbO87xVLCWbTqbWadRVmhxlXATaOjOgcBaWU4ePo0wB41KMFv3o35IXA==",
+			"dev": true,
+			"license": "Apache-2.0",
+			"dependencies": {
+				"playwright": "1.58.2"
+			},
+			"bin": {
+				"playwright": "cli.js"
+			},
+			"engines": {
+				"node": ">=18"
+			}
+		},
 		"node_modules/@protobufjs/aspromise": {
 			"version": "1.1.2",
 			"resolved": "https://registry.npmjs.org/@protobufjs/aspromise/-/aspromise-1.1.2.tgz",
@@ -5753,6 +5771,53 @@
 				"url": "https://github.com/sponsors/jonschlinkert"
 			}
 		},
+		"node_modules/playwright": {
+			"version": "1.58.2",
+			"resolved": "https://registry.npmjs.org/playwright/-/playwright-1.58.2.tgz",
+			"integrity": "sha512-vA30H8Nvkq/cPBnNw4Q8TWz1EJyqgpuinBcHET0YVJVFldr8JDNiU9LaWAE1KqSkRYazuaBhTpB5ZzShOezQ6A==",
+			"dev": true,
+			"license": "Apache-2.0",
+			"dependencies": {
+				"playwright-core": "1.58.2"
+			},
+			"bin": {
+				"playwright": "cli.js"
+			},
+			"engines": {
+				"node": ">=18"
+			},
+			"optionalDependencies": {
+				"fsevents": "2.3.2"
+			}
+		},
+		"node_modules/playwright-core": {
+			"version": "1.58.2",
+			"resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.58.2.tgz",
+			"integrity": "sha512-yZkEtftgwS8CsfYo7nm0KE8jsvm6i/PTgVtB8DL726wNf6H2IMsDuxCpJj59KDaxCtSnrWan2AeDqM7JBaultg==",
+			"dev": true,
+			"license": "Apache-2.0",
+			"bin": {
+				"playwright-core": "cli.js"
+			},
+			"engines": {
+				"node": ">=18"
+			}
+		},
+		"node_modules/playwright/node_modules/fsevents": {
+			"version": "2.3.2",
+			"resolved": "https://registry.npmjs.org/fsevents/-/fsevents-2.3.2.tgz",
+			"integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==",
+			"dev": true,
+			"hasInstallScript": true,
+			"license": "MIT",
+			"optional": true,
+			"os": [
+				"darwin"
+			],
+			"engines": {
+				"node": "^8.16.0 || ^10.6.0 || >=11.0.0"
+			}
+		},
 		"node_modules/pluralize": {
 			"version": "8.0.0",
 			"resolved": "https://registry.npmjs.org/pluralize/-/pluralize-8.0.0.tgz",
@@ -5952,6 +6017,16 @@
 				"node": ">=0.10.0"
 			}
 		},
+		"node_modules/react-resizable-panels": {
+			"version": "4.7.2",
+			"resolved": "https://registry.npmjs.org/react-resizable-panels/-/react-resizable-panels-4.7.2.tgz",
+			"integrity": "sha512-1L2vyeBG96hp7N6x6rzYXJ8EjYiDiffMsqj3cd+T9aOKwscvuyCn2CuZ5q3PoUSTIJUM6Q5DgXH1bdDe6uvh2w==",
+			"license": "MIT",
+			"peerDependencies": {
+				"react": "^18.0.0 || ^19.0.0",
+				"react-dom": "^18.0.0 || ^19.0.0"
+			}
+		},
 		"node_modules/redent": {
 			"version": "3.0.0",
 			"resolved": "https://registry.npmjs.org/redent/-/redent-3.0.0.tgz",
diff --git a/frontend/package.json b/frontend/package.json
index 5594294..c80b68a 100644
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -6,6 +6,8 @@
 	"scripts": {
 		"dev": "vite",
 		"build": "tsc -b && vite build",
+		"e2e": "playwright test",
+		"e2e:install": "playwright install chromium",
 		"lint": "biome check --write",
 		"lint:check": "biome check",
 		"preview": "vite preview",
@@ -31,10 +33,12 @@
 		"react": "^19.1.1",
 		"react-dom": "^19.1.1",
 		"react-markdown": "^9.0.3",
+		"react-resizable-panels": "^4.7.2",
 		"remark-gfm": "^4.0.0"
 	},
 	"devDependencies": {
-		"@biomejs/biome": "2.2.0",
+		"@biomejs/biome": "2.4.6",
+		"@playwright/test": "^1.58.2",
 		"@testing-library/jest-dom": "^6.6.3",
 		"@testing-library/react": "^16.1.0",
 		"@types/node": "^24.3.0",
diff --git a/frontend/playwright.config.ts b/frontend/playwright.config.ts
new file mode 100644
index 0000000..2aaef56
--- /dev/null
+++ b/frontend/playwright.config.ts
@@ -0,0 +1,111 @@
+import path from "node:path";
+import { fileURLToPath } from "node:url";
+import { defineConfig, devices } from "@playwright/test";
+
+const configDir = path.dirname(fileURLToPath(import.meta.url));
+const repoRoot = path.resolve(configDir, "..");
+const backendRoot = path.join(repoRoot, "backend");
+const frontendRoot = configDir;
+const backendPort = process.env.PLAYWRIGHT_BACKEND_PORT ?? "8010";
+const frontendPort = process.env.PLAYWRIGHT_FRONTEND_PORT ?? "4174";
+const backendUrl =
+	process.env.PLAYWRIGHT_BACKEND_URL ?? `http://127.0.0.1:${backendPort}`;
+const frontendUrl =
+	process.env.PLAYWRIGHT_FRONTEND_URL ?? `http://127.0.0.1:${frontendPort}`;
+const cosmosEndpoint =
+	process.env.PLAYWRIGHT_COSMOS_ENDPOINT ?? "http://127.0.0.1:8081";
+const cosmosKey =
+	process.env.PLAYWRIGHT_COSMOS_KEY ??
+	"C2y6yDjf5/R+ob0N8A7Cgv30VRDjEWEhLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw==";
+const devUser = process.env.PLAYWRIGHT_DEV_USER ?? "playwright-e2e@example.com";
+
+const sh = (value: string) => JSON.stringify(value);
+
+export default defineConfig({
+	testDir: "./tests/e2e",
+	fullyParallel: false,
+	workers: 1,
+	timeout: 90_000,
+	expect: {
+		timeout: 15_000,
+	},
+	use: {
+		baseURL: frontendUrl,
+		trace: "retain-on-failure",
+		screenshot: "only-on-failure",
+		video: "retain-on-failure",
+	},
+	projects: [
+		{
+			name: "chromium",
+			use: {
+				...devices["Desktop Chrome"],
+				viewport: { width: 1440, height: 1100 },
+			},
+		},
+	],
+	webServer: [
+		{
+			command: [
+				`cd ${sh(repoRoot)}`,
+				[
+					"python3 -c",
+					sh(
+						"import socket, sys, urllib.parse; " +
+							`u=urllib.parse.urlparse(${JSON.stringify(cosmosEndpoint)}); ` +
+							"host=u.hostname or '127.0.0.1'; " +
+							"port=u.port or 8081; " +
+							"sock=socket.create_connection((host, port), timeout=5); " +
+							"sock.close()",
+					),
+				].join(" "),
+				`cd ${sh(backendRoot)}`,
+				[
+					"env",
+					`GTC_COSMOS_ENDPOINT=${sh(cosmosEndpoint)}`,
+					`GTC_COSMOS_KEY=${sh(cosmosKey)}`,
+					"uv run python scripts/cosmos_container_manager.py",
+					`--endpoint ${sh(cosmosEndpoint)}`,
+					`--key ${sh(cosmosKey)}`,
+					"--no-verify",
+					"--db gt-curator",
+					"--gt-container ground_truth",
+					"--assignments-container assignments",
+					"--tags-container tags",
+					"--tag-definitions-container tag_definitions",
+				].join(" "),
+				[
+					"env",
+					"GTC_ENV_FILE=environments/sample.env",
+					"GTC_AUTH_MODE=dev",
+					"GTC_REPO_BACKEND=cosmos",
+					`GTC_COSMOS_ENDPOINT=${sh(cosmosEndpoint)}`,
+					`GTC_COSMOS_KEY=${sh(cosmosKey)}`,
+					"GTC_COSMOS_DB_NAME=gt-curator",
+					"GTC_USE_COSMOS_EMULATOR=true",
+					"GTC_COSMOS_CONNECTION_VERIFY=false",
+					"GTC_COSMOS_TEST_MODE=false",
+					"GTC_EZAUTH_ENABLED=false",
+					`uv run uvicorn app.main:app --host 127.0.0.1 --port ${backendPort}`,
+				].join(" "),
+			].join(" && "),
+			url: `${backendUrl}/healthz`,
+			timeout: 120_000,
+			reuseExistingServer: false,
+		},
+		{
+			command: [
+				`cd ${sh(frontendRoot)}`,
+				[
+					"env",
+					`HARNESS_BACKEND_URL=${sh(backendUrl)}`,
+					`VITE_DEV_USER_ID=${sh(devUser)}`,
+					`npm run dev -- --host 127.0.0.1 --port ${frontendPort}`,
+				].join(" "),
+			].join(" && "),
+			url: frontendUrl,
+			timeout: 120_000,
+			reuseExistingServer: false,
+		},
+	],
+});
diff --git a/frontend/src/adapters/apiMapper.ts b/frontend/src/adapters/apiMapper.ts
index bf84740..f285c90 100644
--- a/frontend/src/adapters/apiMapper.ts
+++ b/frontend/src/adapters/apiMapper.ts
@@ -1,13 +1,104 @@
 import type { components } from "../api/generated";
-import type { GroundTruthItem, Reference } from "../models/groundTruth";
+import {
+	createConversationTurn,
+	ensureConversationTurnIdentity,
+	type GroundTruthItem,
+	getItemReferences,
+	getLastAgentTurn,
+	getLastUserTurn,
+	type PluginPayload,
+	type Reference,
+	type ToolCallRecord,
+	withDerivedLegacyFields,
+} from "../models/groundTruth";
 import { urlToTitle } from "../models/utils";
 
-export type ApiGroundTruth = components["schemas"]["GroundTruthItem-Output"];
+const _RAG_COMPAT_KEY = "rag-compat";
+const _UNASSOCIATED_KEY = "_unassociated";
+
+type RetrievalBucket = {
+	candidates: Array<{
+		url: string;
+		title?: string;
+		chunk?: string;
+		relevance?: string;
+		toolCallId?: string;
+		messageIndex?: number;
+		turnId?: string;
+		keyParagraph?: string;
+		bonus?: boolean;
+	}>;
+};
+type RetrievalsMap = Record<string, RetrievalBucket>;
+
+type ConversationTurn = NonNullable<GroundTruthItem["history"]>[number];
+export type ApiHistoryEntry = components["schemas"]["HistoryEntry"] & {
+	refs?: components["schemas"]["Reference"][];
+	expectedBehavior?: string[];
+	turnId?: string;
+	stepId?: string;
+};
+export type ApiGroundTruth =
+	components["schemas"]["AgenticGroundTruthEntry-Output"] & {
+		synthQuestion?: string | null;
+		editedQuestion?: string | null;
+		answer?: string | null;
+		refs?: components["schemas"]["Reference"][];
+		totalReferences?: number;
+		tags?: string[];
+		comment?: string | null;
+	} & Omit<
+			components["schemas"]["AgenticGroundTruthEntry-Output"],
+			"history"
+		> & {
+			history?: ApiHistoryEntry[];
+		};
 export type ApiReference = components["schemas"]["Reference"];
 
-export function groundTruthFromApi(api: ApiGroundTruth): GroundTruthItem {
+type StoredTurnIdentity = {
+	turnId?: string;
+	stepId?: string;
+};
+
+function hasOwnField(value: object, field: PropertyKey): boolean {
+	return Object.hasOwn(value, field);
+}
+
+function normalizeToolCalls(
+	toolCalls: components["schemas"]["ToolCallRecord"][] | null | undefined,
+): ToolCallRecord[] | undefined {
+	if (!toolCalls?.length) {
+		return undefined;
+	}
+
+	return toolCalls.map((toolCall) => ({
+		...toolCall,
+		arguments: toolCall.arguments ?? undefined,
+	}));
+}
+
+function getStoredTurnIdentities(
+	plugins: Record<string, PluginPayload>,
+): StoredTurnIdentity[] {
+	const turnIdentity = (
+		plugins[_RAG_COMPAT_KEY]?.data as Record<string, unknown>
+	)?.turnIdentity;
+	return Array.isArray(turnIdentity)
+		? (turnIdentity as StoredTurnIdentity[])
+		: [];
+}
+
+export function groundTruthFromApi(
+	api: ApiGroundTruth,
+	providerId = "api",
+): GroundTruthItem {
+	const plugins: Record<string, PluginPayload> =
+		api.plugins && Object.keys(api.plugins).length
+			? (api.plugins as Record<string, PluginPayload>)
+			: {};
+	const storedTurnIdentity = getStoredTurnIdentities(plugins);
 	let history: GroundTruthItem["history"];
-	const refs: Reference[] = [];
+	const legacyRefs: Reference[] = [];
 	let refIndex = 0;
 
 	if (api.history && api.history.length > 0) {
@@ -15,18 +106,23 @@ export function groundTruthFromApi(api: ApiGroundTruth): GroundTruthItem {
 
 		for (let idx = 0; idx < api.history.length; idx++) {
 			const h = api.history[idx];
-			history[idx] = {
-				role: h.role === "assistant" ? "agent" : "user",
+			// Preserve free-form roles; map "assistant" to "agent" for backward compat.
+			const role = h.role === "assistant" ? "agent" : h.role;
+			const identity = storedTurnIdentity[idx];
+			history[idx] = createConversationTurn({
+				role,
 				content: h.msg,
+				turnId: h.turnId || identity?.turnId,
+				stepId: h.stepId || identity?.stepId,
 				expectedBehavior:
 					h.expectedBehavior && h.expectedBehavior.length > 0
-						? h.expectedBehavior
+						? (h.expectedBehavior as ConversationTurn["expectedBehavior"])
 						: undefined,
-			};
+			});
 
 			if (h.refs && h.refs.length > 0) {
 				for (const r of h.refs) {
-					refs.push({
+					legacyRefs.push({
 						id: `ref_${refIndex++}`,
 						title: r.title || (r.url ? urlToTitle(r.url) : undefined),
 						url: r.url,
@@ -35,6 +131,7 @@ export function groundTruthFromApi(api: ApiGroundTruth): GroundTruthItem {
 						visitedAt: null,
 						bonus: r.bonus === true,
 						messageIndex: idx,
+						turnId: history[idx]?.turnId,
 					});
 				}
 			}
@@ -44,8 +141,18 @@ export function groundTruthFromApi(api: ApiGroundTruth): GroundTruthItem {
 		const initialQuestion = api.editedQuestion || api.synthQuestion || "";
 		if (initialQuestion) {
 			history = [
-				{ role: "user" as const, content: initialQuestion },
-				{ role: "agent" as const, content: api.answer || "" },
+				createConversationTurn({
+					role: "user",
+					content: initialQuestion,
+					turnId: storedTurnIdentity[0]?.turnId,
+					stepId: storedTurnIdentity[0]?.stepId,
+				}),
+				createConversationTurn({
+					role: "agent",
+					content: api.answer || "",
+					turnId: storedTurnIdentity[1]?.turnId,
+					stepId: storedTurnIdentity[1]?.stepId,
+				}),
 			];
 		}
 	}
@@ -54,9 +161,13 @@ export function groundTruthFromApi(api: ApiGroundTruth): GroundTruthItem {
 	if (api.refs && api.refs.length > 0) {
 		const wasLegacyConversion = !api.history || api.history.length === 0;
 		const messageIndex = wasLegacyConversion ? 1 : undefined;
+		const turnId =
+			typeof messageIndex === "number"
+				? history?.[messageIndex]?.turnId
+				: undefined;
 
 		for (const r of api.refs) {
-			refs.push({
+			legacyRefs.push({
 				id: `ref_${refIndex++}`,
 				title: r.title || (r.url ? urlToTitle(r.url) : undefined),
 				url: r.url,
@@ -65,21 +176,57 @@ export function groundTruthFromApi(api: ApiGroundTruth): GroundTruthItem {
 				visitedAt: null,
 				bonus: r.bonus === true,
 				messageIndex,
+				turnId,
 			});
 		}
 	}
 
-	const question = api.editedQuestion || api.synthQuestion || "";
+	// Read per-call retrieval state from plugin data if it already exists
+	const existingRetrievals = (
+		plugins[_RAG_COMPAT_KEY]?.data as Record<string, unknown> | undefined
+	)?.retrievals;
+	const hasPerCallState =
+		existingRetrievals &&
+		typeof existingRetrievals === "object" &&
+		!Array.isArray(existingRetrievals) &&
+		Object.keys(existingRetrievals as Record<string, unknown>).length > 0;
+
+	// When no per-call state exists but legacy refs were extracted, migrate them
+	if (!hasPerCallState && legacyRefs.length > 0) {
+		const retrievals: RetrievalsMap = {};
+		for (const ref of legacyRefs) {
+			const key = ref.toolCallId || _UNASSOCIATED_KEY;
+			if (!retrievals[key]) {
+				retrievals[key] = { candidates: [] };
+			}
+			retrievals[key].candidates.push({
+				url: ref.url,
+				title: ref.title,
+				chunk: ref.snippet,
+				relevance: undefined,
+				toolCallId: ref.toolCallId,
+				messageIndex: ref.turnId ? undefined : ref.messageIndex,
+				turnId: ref.turnId,
+				keyParagraph: ref.keyParagraph,
+				bonus: ref.bonus,
+			});
+		}
+
+		const existingPlugin = plugins[_RAG_COMPAT_KEY];
+		plugins[_RAG_COMPAT_KEY] = {
+			kind: _RAG_COMPAT_KEY,
+			version: existingPlugin?.version || "1.0",
+			data: { ...(existingPlugin?.data || {}), retrievals },
+		};
+	}
+
 	const deleted = api.status === "deleted";
 
-	return {
+	return withDerivedLegacyFields({
 		id: api.id,
-		providerId: "api",
-		question,
-		answer: api.answer ?? "",
-		history,
+		providerId,
+		history: history ? ensureConversationTurnIdentity(history) : history,
 		comment: api.comment ?? undefined,
-		references: refs,
 		status:
 			(deleted ? "draft" : (api.status as GroundTruthItem["status"])) ||
 			("draft" as GroundTruthItem["status"]),
@@ -87,30 +234,61 @@ export function groundTruthFromApi(api: ApiGroundTruth): GroundTruthItem {
 		tags: api.tags || [],
 		manualTags: api.manualTags || [],
 		computedTags: api.computedTags || [],
+		reviewedAt: api.reviewedAt ?? null,
 		totalReferences: api.totalReferences,
+		// Generic schema fields — passed through from the API
+		scenarioId: api.scenarioId || undefined,
+		contextEntries:
+			hasOwnField(api, "contextEntries") && Array.isArray(api.contextEntries)
+				? api.contextEntries
+				: undefined,
+		toolCalls: normalizeToolCalls(api.toolCalls),
+		expectedTools: api.expectedTools ?? undefined,
+		feedback: api.feedback?.length ? api.feedback : undefined,
+		metadata:
+			api.metadata && Object.keys(api.metadata).length
+				? (api.metadata as Record<string, unknown>)
+				: undefined,
+		plugins: Object.keys(plugins).length ? plugins : undefined,
+		traceIds: api.traceIds ?? undefined,
+		tracePayload:
+			api.tracePayload && Object.keys(api.tracePayload).length
+				? (api.tracePayload as Record<string, unknown>)
+				: undefined,
 		...({
 			datasetName: api.datasetName,
 			bucket: (api.bucket as string) || "0",
 			_etag: api._etag,
 		} as Record<string, unknown>),
-	};
+	});
 }
 
 export function groundTruthToPatch(args: {
 	item: GroundTruthItem;
 	originalApi?: ApiGroundTruth;
 }): Partial<ApiGroundTruth> {
-	const { item, originalApi } = args;
+	const { originalApi } = args;
+	const item = withDerivedLegacyFields(args.item);
+	const history = ensureConversationTurnIdentity(item.history);
+
+	// Extract references from per-call plugin state
+	const references = getItemReferences(item);
 
 	const hadLegacyTopLevelRefs =
 		!!originalApi &&
-		!originalApi.history &&
+		(!originalApi.history || originalApi.history.length === 0) &&
 		(originalApi.refs?.length || 0) > 0;
 
 	let topLevelRefs: ApiReference[] = [];
 	if (hadLegacyTopLevelRefs) {
-		topLevelRefs = (item.references || [])
-			.filter((r) => r.messageIndex === 1 || r.messageIndex === undefined)
+		const legacyAgentTurnId = history[1]?.turnId;
+		topLevelRefs = references
+			.filter(
+				(r) =>
+					r.turnId === legacyAgentTurnId ||
+					r.messageIndex === 1 ||
+					r.messageIndex === undefined,
+			)
 			.map((r) => ({
 				url: r.url,
 				title: r.title || undefined,
@@ -119,7 +297,7 @@ export function groundTruthToPatch(args: {
 				bonus: !!r.bonus,
 			}));
 	} else {
-		topLevelRefs = (item.references || [])
+		topLevelRefs = references
 			.filter((r) => r.messageIndex === undefined)
 			.map((r) => ({
 				url: r.url,
@@ -134,18 +312,18 @@ export function groundTruthToPatch(args: {
 		status: (item.deleted
 			? "deleted"
 			: item.status) as components["schemas"]["GroundTruthStatus"],
-		answer: item.answer,
-		editedQuestion: item.question,
+		answer: getLastAgentTurn(item),
+		editedQuestion: getLastUserTurn(item),
 		refs: topLevelRefs,
 		manualTags: item.manualTags || [],
 	};
 
-	if (item.history && item.history.length > 0) {
-		body.history = item.history.map((turn, idx) => {
+	if (history.length > 0) {
+		body.history = history.map((turn, idx) => {
 			let turnRefs: ApiReference[] | undefined;
-			if (turn.role === "agent") {
-				const refsForTurn = (item.references || []).filter(
-					(r) => r.messageIndex === idx,
+			if (turn.role !== "user") {
+				const refsForTurn = references.filter(
+					(r) => r.turnId === turn.turnId || r.messageIndex === idx,
 				);
 				if (refsForTurn.length > 0) {
 					turnRefs = refsForTurn.map((r) => ({
@@ -158,9 +336,14 @@ export function groundTruthToPatch(args: {
 				}
 			}
 
+			// Map "agent" back to "assistant" for backward compat; preserve other free-form roles.
+			const apiRole = turn.role === "agent" ? "assistant" : turn.role;
+
 			return {
-				role: turn.role === "agent" ? "assistant" : "user",
+				role: apiRole,
 				msg: turn.content,
+				turnId: turn.turnId,
+				stepId: turn.stepId,
 				expectedBehavior: turn.expectedBehavior || undefined,
 				...(turnRefs ? { refs: turnRefs } : {}),
 			};
@@ -171,5 +354,49 @@ export function groundTruthToPatch(args: {
 		(body as Record<string, unknown>).comment = item.comment ?? null;
 	}
 
+	// Pass through generic fields when present
+	if (
+		hasOwnField(item, "contextEntries") &&
+		Array.isArray(item.contextEntries)
+	) {
+		(body as Record<string, unknown>).contextEntries = item.contextEntries;
+	}
+	if (item.toolCalls?.length) {
+		(body as Record<string, unknown>).toolCalls = item.toolCalls;
+	}
+	if (item.expectedTools) {
+		(body as Record<string, unknown>).expectedTools = item.expectedTools;
+	}
+	if (item.feedback?.length) {
+		(body as Record<string, unknown>).feedback = item.feedback;
+	}
+	if (item.metadata && Object.keys(item.metadata).length) {
+		(body as Record<string, unknown>).metadata = item.metadata;
+	}
+	const plugins = { ...(item.plugins || {}) };
+	const existingCompat = plugins[_RAG_COMPAT_KEY];
+	if (history.length > 0) {
+		plugins[_RAG_COMPAT_KEY] = {
+			kind: _RAG_COMPAT_KEY,
+			version: existingCompat?.version || "1.0",
+			data: {
+				...(existingCompat?.data || {}),
+				turnIdentity: history.map((turn) => ({
+					turnId: turn.turnId,
+					stepId: turn.stepId,
+				})),
+			},
+		};
+	}
+	if (Object.keys(plugins).length) {
+		(body as Record<string, unknown>).plugins = plugins;
+	}
+	if (item.traceIds) {
+		(body as Record<string, unknown>).traceIds = item.traceIds;
+	}
+	if (item.tracePayload && Object.keys(item.tracePayload).length) {
+		(body as Record<string, unknown>).tracePayload = item.tracePayload;
+	}
+
 	return body;
 }
diff --git a/frontend/src/api/client.ts b/frontend/src/api/client.ts
index 75fe131..81f281d 100644
--- a/frontend/src/api/client.ts
+++ b/frontend/src/api/client.ts
@@ -1,4 +1,5 @@
 import createClient from "openapi-fetch";
+import { prefixAppBasePath } from "../services/http";
 import type { paths } from "./generated";
 
 // Typed OpenAPI client configured for our backend
@@ -12,7 +13,19 @@ const defaultHeaders = (() => {
 
 // Wrap fetch to ensure JSON payloads are emitted as UTF-8 with charset declared.
 // We do NOT manually craft \uXXXX sequences; we rely on JSON.stringify and send bytes as-is.
+function withAppBasePath(input: RequestInfo | URL): RequestInfo | URL {
+	if (typeof input === "string") {
+		return prefixAppBasePath(input);
+	}
+	if (typeof Request !== "undefined" && input instanceof Request) {
+		const nextUrl = prefixAppBasePath(input.url);
+		return nextUrl === input.url ? input : new Request(nextUrl, input);
+	}
+	return input;
+}
+
 const utf8JsonFetch: typeof fetch = (input, init) => {
+	const resolvedInput = withAppBasePath(input);
 	if (init && init.body != null) {
 		const hdrs = new Headers(init.headers as HeadersInit | undefined);
 		const contentType = hdrs.get("Content-Type") || hdrs.get("content-type");
@@ -32,7 +45,7 @@ const utf8JsonFetch: typeof fetch = (input, init) => {
 				init.body instanceof ArrayBuffer
 			) {
 				// Already a binary payload; leave it alone
-				return fetch(input, { ...init, headers: hdrs });
+				return fetch(resolvedInput, { ...init, headers: hdrs });
 			} else {
 				try {
 					// Best effort stringify
@@ -41,17 +54,17 @@ const utf8JsonFetch: typeof fetch = (input, init) => {
 					);
 				} catch {
 					// Fall back to default
-					return fetch(input, { ...init, headers: hdrs });
+					return fetch(resolvedInput, { ...init, headers: hdrs });
 				}
 			}
 			// Send as Blob with explicit type to avoid any implicit re-encoding quirks
 			const blob = new Blob([bodyStr], {
 				type: hdrs.get("Content-Type") || "application/json; charset=utf-8",
 			});
-			return fetch(input, { ...init, headers: hdrs, body: blob });
+			return fetch(resolvedInput, { ...init, headers: hdrs, body: blob });
 		}
 	}
-	return fetch(input, init as RequestInit);
+	return fetch(resolvedInput, init as RequestInit);
 };
 
 export const client = createClient<paths>({
diff --git a/frontend/src/api/generated.ts b/frontend/src/api/generated.ts
index 35d6626..e2a83f0 100644
--- a/frontend/src/api/generated.ts
+++ b/frontend/src/api/generated.ts
@@ -507,23 +507,6 @@ export interface paths {
         patch?: never;
         trace?: never;
     };
-    "/v1/chat": {
-        parameters: {
-            query?: never;
-            header?: never;
-            path?: never;
-            cookie?: never;
-        };
-        get?: never;
-        put?: never;
-        /** Chat */
-        post: operations["chat_v1_chat_post"];
-        delete?: never;
-        options?: never;
-        head?: never;
-        patch?: never;
-        trace?: never;
-    };
 }
 export type webhooks = Record<string, never>;
 export interface components {
@@ -533,6 +516,190 @@ export interface components {
             /** Tags */
             tags: string[];
         };
+        /**
+         * AgenticGroundTruthEntry
+         * @description Generic agentic-first host model.
+         *
+         *     The core contract intentionally exposes only the generic schema in OpenAPI. Legacy
+         *     RAG-shaped payloads are translated into this shape when validating this base class so
+         *     existing data can be carried forward without remaining top-level contract fields.
+         */
+        "AgenticGroundTruthEntry-Input": {
+            /** Id */
+            id: string;
+            /** Datasetname */
+            datasetName: string;
+            /** Bucket */
+            bucket?: string | null;
+            /** @default draft */
+            status: components["schemas"]["GroundTruthStatus"];
+            /**
+             * Doctype
+             * @default ground-truth-item
+             */
+            docType: string;
+            /**
+             * Schemaversion
+             * @default v2
+             */
+            schemaVersion: string;
+            /** Manualtags */
+            manualTags?: string[];
+            /** Computedtags */
+            computedTags?: string[];
+            /**
+             * Comment
+             * @default
+             */
+            comment: string;
+            /** Assignedto */
+            assignedTo?: string | null;
+            /** Assignedat */
+            assignedAt?: string | null;
+            /**
+             * Updatedat
+             * Format: date-time
+             */
+            updatedAt?: string;
+            /** Updatedby */
+            updatedBy?: string | null;
+            /** Reviewedat */
+            reviewedAt?: string | null;
+            /** Etag */
+            _etag?: string | null;
+            /**
+             * Scenarioid
+             * @default
+             */
+            scenarioId: string;
+            /** History */
+            history?: components["schemas"]["HistoryEntry"][];
+            /** Contextentries */
+            contextEntries?: components["schemas"]["ContextEntry"][];
+            /** Traceids */
+            traceIds?: {
+                [key: string]: string;
+            } | null;
+            /** Toolcalls */
+            toolCalls?: components["schemas"]["ToolCallRecord"][];
+            expectedTools?: components["schemas"]["ExpectedTools"];
+            /** Feedback */
+            feedback?: components["schemas"]["FeedbackEntry"][];
+            /** Metadata */
+            metadata?: {
+                [key: string]: unknown;
+            };
+            /** Plugins */
+            plugins?: {
+                [key: string]: components["schemas"]["PluginPayload"];
+            };
+            /** Createdby */
+            createdBy?: string | null;
+            /** Createdat */
+            createdAt?: string | null;
+            /** Tracepayload */
+            tracePayload?: {
+                [key: string]: unknown;
+            };
+        };
+        /**
+         * AgenticGroundTruthEntry
+         * @description Generic agentic-first host model.
+         *
+         *     The core contract intentionally exposes only the generic schema in OpenAPI. Legacy
+         *     RAG-shaped payloads are translated into this shape when validating this base class so
+         *     existing data can be carried forward without remaining top-level contract fields.
+         */
+        "AgenticGroundTruthEntry-Output": {
+            /** Id */
+            id: string;
+            /** Datasetname */
+            datasetName: string;
+            /** Bucket */
+            bucket?: string | null;
+            /** @default draft */
+            status: components["schemas"]["GroundTruthStatus"];
+            /**
+             * Doctype
+             * @default ground-truth-item
+             */
+            docType: string;
+            /**
+             * Schemaversion
+             * @default v2
+             */
+            schemaVersion: string;
+            /** Manualtags */
+            manualTags?: string[];
+            /** Computedtags */
+            computedTags?: string[];
+            /**
+             * Comment
+             * @default
+             */
+            comment: string;
+            /** Assignedto */
+            assignedTo?: string | null;
+            /** Assignedat */
+            assignedAt?: string | null;
+            /**
+             * Updatedat
+             * Format: date-time
+             */
+            updatedAt?: string;
+            /** Updatedby */
+            updatedBy?: string | null;
+            /** Reviewedat */
+            reviewedAt?: string | null;
+            /** Etag */
+            _etag?: string | null;
+            /**
+             * Scenarioid
+             * @default
+             */
+            scenarioId: string;
+            /** History */
+            history?: components["schemas"]["HistoryEntry"][];
+            /** Contextentries */
+            contextEntries?: components["schemas"]["ContextEntry"][];
+            /** Traceids */
+            traceIds?: {
+                [key: string]: string;
+            } | null;
+            /** Toolcalls */
+            toolCalls?: components["schemas"]["ToolCallRecord"][];
+            expectedTools?: components["schemas"]["ExpectedTools"];
+            /** Feedback */
+            feedback?: components["schemas"]["FeedbackEntry"][];
+            /** Metadata */
+            metadata?: {
+                [key: string]: unknown;
+            };
+            /** Plugins */
+            plugins?: {
+                [key: string]: components["schemas"]["PluginPayload"];
+            };
+            /** Createdby */
+            createdBy?: string | null;
+            /** Createdat */
+            createdAt?: string | null;
+            /** Tracepayload */
+            tracePayload?: {
+                [key: string]: unknown;
+            };
+            /** Tags */
+            readonly tags: string[];
+            /** Synthquestion */
+            readonly synthQuestion: string | null;
+            /** Editedquestion */
+            readonly editedQuestion: string | null;
+            /** Answer */
+            readonly answer: string | null;
+            /** Refs */
+            readonly refs: components["schemas"]["Reference"][];
+            /** Totalreferences */
+            readonly totalReferences: number;
+        };
         /**
          * AssignItemRequest
          * @description Request body for assignment endpoint.
@@ -545,24 +712,12 @@ export interface components {
              */
             force: boolean;
         };
-        /**
-         * AssignmentUpdateRequest
-         * @description Payload for SME update (save draft / approve / skip / delete).
-         *
-         *     Using a Pydantic model allows camelCase -> snake_case alias handling. All fields optional; we
-         *     only mutate those explicitly provided (tracked via model_fields_set).
-         */
+        /** AssignmentUpdateRequest */
         AssignmentUpdateRequest: {
-            /** Editedquestion */
-            editedQuestion?: string | null;
-            /** Answer */
-            answer?: string | null;
             /** Comment */
             comment?: string | null;
             /** Status */
-            status?: components["schemas"]["GroundTruthStatus"] | string | null;
-            /** Refs */
-            refs?: components["schemas"]["Reference"][] | null;
+            status?: components["schemas"]["GroundTruthStatus"] | string;
             /** Manualtags */
             manualTags?: string[] | null;
             /** Approve */
@@ -570,16 +725,37 @@ export interface components {
             /** Etag */
             etag?: string | null;
             /** History */
-            history?: {
+            history?: components["schemas"]["HistoryEntryPatch"][] | null;
+            /** Contextentries */
+            contextEntries?: components["schemas"]["ContextEntry"][] | null;
+            /** Toolcalls */
+            toolCalls?: components["schemas"]["ToolCallRecord"][] | null;
+            /** Expectedtools */
+            expectedTools?: components["schemas"]["ExpectedTools"];
+            /** Feedback */
+            feedback?: components["schemas"]["FeedbackEntry"][] | null;
+            /** Metadata */
+            metadata?: {
+                [key: string]: unknown;
+            } | null;
+            /** Plugins */
+            plugins?: {
+                [key: string]: components["schemas"]["PluginPayload"];
+            } | null;
+            /** Traceids */
+            traceIds?: {
+                [key: string]: string;
+            } | null;
+            /** Tracepayload */
+            tracePayload?: {
                 [key: string]: unknown;
-            }[] | null;
+            } | null;
+            /** Scenarioid */
+            scenarioId?: string | null;
         } & {
             [key: string]: unknown;
         };
-        /**
-         * BulkImportError
-         * @description Structured error for bulk import failures.
-         */
+        /** BulkImportError */
         BulkImportError: {
             /**
              * Index
@@ -607,34 +783,12 @@ export interface components {
              */
             message: string;
         };
-        /** ChatReference */
-        ChatReference: {
-            /** Id */
-            id?: string | null;
-            /** Title */
-            title?: string | null;
-            /** Url */
-            url?: string | null;
-            /** Snippet */
-            snippet?: string | null;
-            /** Keyparagraph */
-            keyParagraph?: string | null;
-        } & {
-            [key: string]: unknown;
-        };
-        /** ChatRequest */
-        ChatRequest: {
-            /** Message */
-            message: string;
-            /** Context */
-            context?: string | null;
-        };
-        /** ChatResponse */
-        ChatResponse: {
-            /** Content */
-            content: string;
-            /** References */
-            references?: components["schemas"]["ChatReference"][];
+        /** ContextEntry */
+        ContextEntry: {
+            /** Key */
+            key: string;
+            /** Value */
+            value: unknown;
         };
         /** CurationInstructionsUpdate */
         CurationInstructionsUpdate: {
@@ -643,14 +797,7 @@ export interface components {
             /** Etag */
             _etag?: string | null;
         };
-        /**
-         * DatasetCurationInstructions
-         * @description Dataset-level curation instructions document (schemaVersion v1).
-         *
-         *     Stored in the same Cosmos container as ground-truth items using MultiHash PK
-         *     [/datasetName, /bucket] with bucket fixed to 0 and a stable id pattern
-         *     "curation-instructions|{datasetName}".
-         */
+        /** DatasetCurationInstructions */
         DatasetCurationInstructions: {
             /** Id */
             id: string;
@@ -722,18 +869,17 @@ export interface components {
             matchReason: string;
         };
         /**
-         * ExpectedBehavior
-         * @description Expected behavior tags for history items in ground truth evaluation.
-         *
-         *     These tags describe what the agent should do at each turn of a conversation:
-         *     - tool:search: Agent should perform a search/retrieval operation
-         *     - generation:answer: Agent should generate a direct answer
-         *     - generation:need-context: Agent should ask for more context
-         *     - generation:clarification: Agent should ask for clarification
-         *     - generation:out-of-domain: Agent should indicate the query is out of domain
-         * @enum {string}
+         * ExpectedTools
+         * @description Tool expectations. Tools are implicitly allowed unless listed here.
          */
-        ExpectedBehavior: "tool:search" | "generation:answer" | "generation:need-context" | "generation:clarification" | "generation:out-of-domain";
+        ExpectedTools: {
+            /** Required */
+            required?: components["schemas"]["ToolExpectation"][];
+            /** Optional */
+            optional?: components["schemas"]["ToolExpectation"][];
+            /** Notneeded */
+            notNeeded?: components["schemas"]["ToolExpectation"][];
+        };
         /** ExportDeliveryOptions */
         ExportDeliveryOptions: {
             /**
@@ -753,6 +899,18 @@ export interface components {
              */
             status: string;
         };
+        /** FeedbackEntry */
+        FeedbackEntry: {
+            /**
+             * Source
+             * @default
+             */
+            source: string;
+            /** Values */
+            values?: {
+                [key: string]: unknown;
+            };
+        };
         /**
          * FrontendConfig
          * @description Frontend runtime configuration.
@@ -777,169 +935,10 @@ export interface components {
             /** Groups */
             groups: components["schemas"]["TagGroupGlossaryDTO"][];
         };
-        /**
-         * GroundTruthItem
-         * @description Canonical Ground Truth item aligned to wire schema (schemaVersion v1).
-         *
-         *     All fields with camelCase wire names use aliases; we accept both field names and aliases
-         *     on input (populate_by_name=True) and always serialize using by_alias.
-         */
-        "GroundTruthItem-Input": {
-            /** Id */
-            id: string;
-            /** Datasetname */
-            datasetName: string;
-            /** Bucket */
-            bucket?: string | null;
-            /** @default draft */
-            status: components["schemas"]["GroundTruthStatus"];
-            /**
-             * Doctype
-             * @default ground-truth-item
-             */
-            docType: string;
-            /**
-             * Schemaversion
-             * @default v2
-             */
-            schemaVersion: string;
-            /** Synthquestion */
-            synthQuestion: string;
-            /** Editedquestion */
-            editedQuestion?: string | null;
-            /** Answer */
-            answer?: string | null;
-            /** Refs */
-            refs?: components["schemas"]["Reference"][];
-            /** Manualtags */
-            manualTags?: string[];
-            /** Computedtags */
-            computedTags?: string[];
-            /** Comment */
-            comment?: string | null;
-            /** History */
-            history?: components["schemas"]["HistoryItem"][] | null;
-            /** Contextusedforgeneration */
-            contextUsedForGeneration?: string | null;
-            /** Contextsource */
-            contextSource?: string | null;
-            /** Modelusedforgeneration */
-            modelUsedForGeneration?: string | null;
-            /** Semanticclusternumber */
-            semanticClusterNumber?: number | null;
-            /** Weight */
-            weight?: number | null;
-            /** Samplingbucket */
-            samplingBucket?: number | null;
-            /** Questionlength */
-            questionLength?: number | null;
-            /** Assignedto */
-            assignedTo?: string | null;
-            /** Assignedat */
-            assignedAt?: string | null;
-            /**
-             * Updatedat
-             * Format: date-time
-             */
-            updatedAt?: string;
-            /** Updatedby */
-            updatedBy?: string | null;
-            /** Reviewedat */
-            reviewedAt?: string | null;
-            /** Etag */
-            _etag?: string | null;
-            /**
-             * Totalreferences
-             * @default 0
-             */
-            totalReferences: number;
-        };
-        /**
-         * GroundTruthItem
-         * @description Canonical Ground Truth item aligned to wire schema (schemaVersion v1).
-         *
-         *     All fields with camelCase wire names use aliases; we accept both field names and aliases
-         *     on input (populate_by_name=True) and always serialize using by_alias.
-         */
-        "GroundTruthItem-Output": {
-            /** Id */
-            id: string;
-            /** Datasetname */
-            datasetName: string;
-            /** Bucket */
-            bucket?: string | null;
-            /** @default draft */
-            status: components["schemas"]["GroundTruthStatus"];
-            /**
-             * Doctype
-             * @default ground-truth-item
-             */
-            docType: string;
-            /**
-             * Schemaversion
-             * @default v2
-             */
-            schemaVersion: string;
-            /** Synthquestion */
-            synthQuestion: string;
-            /** Editedquestion */
-            editedQuestion?: string | null;
-            /** Answer */
-            answer?: string | null;
-            /** Refs */
-            refs?: components["schemas"]["Reference"][];
-            /** Manualtags */
-            manualTags?: string[];
-            /** Computedtags */
-            computedTags?: string[];
-            /** Comment */
-            comment?: string | null;
-            /** History */
-            history?: components["schemas"]["HistoryItem"][] | null;
-            /** Contextusedforgeneration */
-            contextUsedForGeneration?: string | null;
-            /** Contextsource */
-            contextSource?: string | null;
-            /** Modelusedforgeneration */
-            modelUsedForGeneration?: string | null;
-            /** Semanticclusternumber */
-            semanticClusterNumber?: number | null;
-            /** Weight */
-            weight?: number | null;
-            /** Samplingbucket */
-            samplingBucket?: number | null;
-            /** Questionlength */
-            questionLength?: number | null;
-            /** Assignedto */
-            assignedTo?: string | null;
-            /** Assignedat */
-            assignedAt?: string | null;
-            /**
-             * Updatedat
-             * Format: date-time
-             */
-            updatedAt?: string;
-            /** Updatedby */
-            updatedBy?: string | null;
-            /** Reviewedat */
-            reviewedAt?: string | null;
-            /** Etag */
-            _etag?: string | null;
-            /**
-             * Totalreferences
-             * @default 0
-             */
-            totalReferences: number;
-            /**
-             * Tags
-             * @description Return a merged, sorted view of manual and computed tags.
-             */
-            readonly tags: string[];
-        };
         /** GroundTruthListResponse */
         GroundTruthListResponse: {
             /** Items */
-            items: components["schemas"]["GroundTruthItem-Output"][];
+            items: components["schemas"]["AgenticGroundTruthEntry-Output"][];
             pagination: components["schemas"]["PaginationMetadata"];
         };
         /**
@@ -947,32 +946,68 @@ export interface components {
          * @enum {string}
          */
         GroundTruthStatus: "draft" | "approved" | "deleted" | "skipped";
+        /** GroundTruthUpdateRequest */
+        GroundTruthUpdateRequest: {
+            /** Status */
+            status?: components["schemas"]["GroundTruthStatus"] | string;
+            /** Comment */
+            comment?: string | null;
+            /** History */
+            history?: components["schemas"]["HistoryEntryPatch"][] | null;
+            /** Contextentries */
+            contextEntries?: components["schemas"]["ContextEntry"][] | null;
+            /** Toolcalls */
+            toolCalls?: components["schemas"]["ToolCallRecord"][] | null;
+            /** Expectedtools */
+            expectedTools?: components["schemas"]["ExpectedTools"];
+            /** Feedback */
+            feedback?: components["schemas"]["FeedbackEntry"][] | null;
+            /** Metadata */
+            metadata?: {
+                [key: string]: unknown;
+            } | null;
+            /** Plugins */
+            plugins?: {
+                [key: string]: components["schemas"]["PluginPayload"];
+            } | null;
+            /** Manualtags */
+            manualTags?: string[] | null;
+            /** Traceids */
+            traceIds?: {
+                [key: string]: string;
+            } | null;
+            /** Tracepayload */
+            tracePayload?: {
+                [key: string]: unknown;
+            } | null;
+            /** Scenarioid */
+            scenarioId?: string | null;
+            /** Etag */
+            etag?: string | null;
+        } & {
+            [key: string]: unknown;
+        };
         /** HTTPValidationError */
         HTTPValidationError: {
             /** Detail */
             detail?: components["schemas"]["ValidationError"][];
         };
-        /**
-         * HistoryItem
-         * @description Represents a single item in the multi-turn history.
-         */
-        HistoryItem: {
-            role: components["schemas"]["HistoryItemRole"];
+        /** HistoryEntry */
+        HistoryEntry: {
+            /** Role */
+            role: string;
             /** Msg */
             msg: string;
-            /** Refs */
-            refs?: components["schemas"]["Reference"][] | null;
-            /**
-             * Expectedbehavior
-             * @description Expected behavior(s) for this turn in the conversation (e.g., tool:search, generation:answer)
-             */
-            expectedBehavior?: components["schemas"]["ExpectedBehavior"][] | null;
         };
-        /**
-         * HistoryItemRole
-         * @enum {string}
-         */
-        HistoryItemRole: "user" | "assistant";
+        /** HistoryEntryPatch */
+        HistoryEntryPatch: {
+            /** Role */
+            role: string;
+            /** Msg */
+            msg?: string | null;
+        } & {
+            [key: string]: unknown;
+        };
         /** ImportBulkResponse */
         ImportBulkResponse: {
             /**
@@ -1040,10 +1075,7 @@ export interface components {
              */
             position: number;
         };
-        /**
-         * PaginationMetadata
-         * @description Pagination metadata for list responses.
-         */
+        /** PaginationMetadata */
         PaginationMetadata: {
             /**
              * Page
@@ -1076,6 +1108,20 @@ export interface components {
              */
             hasPrev: boolean;
         };
+        /** PluginPayload */
+        PluginPayload: {
+            /** Kind */
+            kind: string;
+            /**
+             * Version
+             * @default 1.0
+             */
+            version: string;
+            /** Data */
+            data?: {
+                [key: string]: unknown;
+            };
+        };
         /**
          * RecomputeTagsResponse
          * @description Response for bulk computed tag recomputation.
@@ -1119,9 +1165,7 @@ export interface components {
         };
         /**
          * Reference
-         * @description Wire reference object.
-         *
-         *     { url, title, content, keyExcerpt, type, bonus, messageIndex }
+         * @description Legacy RAG reference object retained for compatibility helpers and tests.
          */
         Reference: {
             /**
@@ -1156,7 +1200,7 @@ export interface components {
         /** SelfServeResponse */
         SelfServeResponse: {
             /** Assigned */
-            assigned: components["schemas"]["GroundTruthItem-Output"][];
+            assigned: components["schemas"]["AgenticGroundTruthEntry-Output"][];
             /** Requested */
             requested: number;
             /** Assignedcount */
@@ -1259,6 +1303,45 @@ export interface components {
             /** Groups */
             groups: components["schemas"]["TagGroupDTO"][];
         };
+        /** ToolCallRecord */
+        ToolCallRecord: {
+            /**
+             * Id
+             * @default
+             */
+            id: string;
+            /** Name */
+            name: string;
+            /**
+             * Calltype
+             * @default tool
+             * @enum {string}
+             */
+            callType: "tool" | "subagent";
+            /** Arguments */
+            arguments?: {
+                [key: string]: unknown;
+            } | null;
+            /** Agent */
+            agent?: string | null;
+            /** Stepnumber */
+            stepNumber?: number | null;
+            /** Parallelgroup */
+            parallelGroup?: string | null;
+            /** Parentcallid */
+            parentCallId?: string | null;
+            /** Response */
+            response?: unknown;
+        };
+        /** ToolExpectation */
+        ToolExpectation: {
+            /** Name */
+            name: string;
+            /** Arguments */
+            arguments?: {
+                [key: string]: unknown;
+            } | string | null;
+        };
         /** ValidationError */
         ValidationError: {
             /** Location */
@@ -1268,10 +1351,7 @@ export interface components {
             /** Error Type */
             type: string;
         };
-        /**
-         * ValidationSummary
-         * @description Summary statistics for bulk import.
-         */
+        /** ValidationSummary */
         ValidationSummary: {
             /**
              * Total
@@ -1418,7 +1498,7 @@ export interface operations {
         };
         requestBody: {
             content: {
-                "application/json": components["schemas"]["GroundTruthItem-Input"][];
+                "application/json": components["schemas"]["AgenticGroundTruthEntry-Input"][];
             };
         };
         responses: {
@@ -1514,7 +1594,7 @@ export interface operations {
                     [name: string]: unknown;
                 };
                 content: {
-                    "application/json": components["schemas"]["GroundTruthItem-Output"][];
+                    "application/json": components["schemas"]["AgenticGroundTruthEntry-Output"][];
                 };
             };
             /** @description Validation Error */
@@ -1547,7 +1627,7 @@ export interface operations {
                     [name: string]: unknown;
                 };
                 content: {
-                    "application/json": components["schemas"]["GroundTruthItem-Output"];
+                    "application/json": components["schemas"]["AgenticGroundTruthEntry-Output"];
                 };
             };
             /** @description Validation Error */
@@ -1576,9 +1656,7 @@ export interface operations {
         };
         requestBody: {
             content: {
-                "application/json": {
-                    [key: string]: unknown;
-                };
+                "application/json": components["schemas"]["GroundTruthUpdateRequest"];
             };
         };
         responses: {
@@ -1588,7 +1666,7 @@ export interface operations {
                     [name: string]: unknown;
                 };
                 content: {
-                    "application/json": components["schemas"]["GroundTruthItem-Output"];
+                    "application/json": components["schemas"]["AgenticGroundTruthEntry-Output"];
                 };
             };
             /** @description Validation Error */
@@ -1723,7 +1801,7 @@ export interface operations {
                     [name: string]: unknown;
                 };
                 content: {
-                    "application/json": components["schemas"]["GroundTruthItem-Output"][];
+                    "application/json": components["schemas"]["AgenticGroundTruthEntry-Output"][];
                 };
             };
         };
@@ -1753,7 +1831,7 @@ export interface operations {
                     [name: string]: unknown;
                 };
                 content: {
-                    "application/json": components["schemas"]["GroundTruthItem-Output"];
+                    "application/json": components["schemas"]["AgenticGroundTruthEntry-Output"];
                 };
             };
             /** @description Validation Error */
@@ -1823,7 +1901,7 @@ export interface operations {
                     [name: string]: unknown;
                 };
                 content: {
-                    "application/json": components["schemas"]["GroundTruthItem-Output"];
+                    "application/json": components["schemas"]["AgenticGroundTruthEntry-Output"];
                 };
             };
             /** @description Validation Error */
@@ -2237,37 +2315,4 @@ export interface operations {
             };
         };
     };
-    chat_v1_chat_post: {
-        parameters: {
-            query?: never;
-            header?: never;
-            path?: never;
-            cookie?: never;
-        };
-        requestBody: {
-            content: {
-                "application/json": components["schemas"]["ChatRequest"];
-            };
-        };
-        responses: {
-            /** @description Successful Response */
-            200: {
-                headers: {
-                    [name: string]: unknown;
-                };
-                content: {
-                    "application/json": components["schemas"]["ChatResponse"];
-                };
-            };
-            /** @description Validation Error */
-            422: {
-                headers: {
-                    [name: string]: unknown;
-                };
-                content: {
-                    "application/json": components["schemas"]["HTTPValidationError"];
-                };
-            };
-        };
-    };
 }
diff --git a/frontend/src/api/openapi.json b/frontend/src/api/openapi.json
index 69dbb8a..6135025 100644
--- a/frontend/src/api/openapi.json
+++ b/frontend/src/api/openapi.json
@@ -110,7 +110,7 @@
 							"schema": {
 								"type": "array",
 								"items": {
-									"$ref": "#/components/schemas/GroundTruthItem-Input"
+									"$ref": "#/components/schemas/AgenticGroundTruthEntry-Input"
 								},
 								"title": "Items"
 							}
@@ -422,7 +422,7 @@
 								"schema": {
 									"type": "array",
 									"items": {
-										"$ref": "#/components/schemas/GroundTruthItem-Output"
+										"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
 									},
 									"title": "Response List Ground Truths V1 Ground Truths  Datasetname  Get"
 								}
@@ -483,7 +483,7 @@
 						"content": {
 							"application/json": {
 								"schema": {
-									"$ref": "#/components/schemas/GroundTruthItem-Output"
+									"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
 								}
 							}
 						}
@@ -555,9 +555,7 @@
 					"content": {
 						"application/json": {
 							"schema": {
-								"type": "object",
-								"additionalProperties": true,
-								"title": "Payload"
+								"$ref": "#/components/schemas/GroundTruthUpdateRequest"
 							}
 						}
 					}
@@ -568,7 +566,7 @@
 						"content": {
 							"application/json": {
 								"schema": {
-									"$ref": "#/components/schemas/GroundTruthItem-Output"
+									"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
 								}
 							}
 						}
@@ -781,7 +779,7 @@
 							"application/json": {
 								"schema": {
 									"items": {
-										"$ref": "#/components/schemas/GroundTruthItem-Output"
+										"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
 									},
 									"type": "array",
 									"title": "Response List My Assignments V1 Assignments My Get"
@@ -859,7 +857,7 @@
 						"content": {
 							"application/json": {
 								"schema": {
-									"$ref": "#/components/schemas/GroundTruthItem-Output"
+									"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
 								}
 							}
 						}
@@ -994,7 +992,7 @@
 						"content": {
 							"application/json": {
 								"schema": {
-									"$ref": "#/components/schemas/GroundTruthItem-Output"
+									"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
 								}
 							}
 						}
@@ -1539,45 +1537,6 @@
 					}
 				}
 			}
-		},
-		"/v1/chat": {
-			"post": {
-				"tags": ["chat"],
-				"summary": "Chat",
-				"operationId": "chat_v1_chat_post",
-				"requestBody": {
-					"content": {
-						"application/json": {
-							"schema": {
-								"$ref": "#/components/schemas/ChatRequest"
-							}
-						}
-					},
-					"required": true
-				},
-				"responses": {
-					"200": {
-						"description": "Successful Response",
-						"content": {
-							"application/json": {
-								"schema": {
-									"$ref": "#/components/schemas/ChatResponse"
-								}
-							}
-						}
-					},
-					"422": {
-						"description": "Validation Error",
-						"content": {
-							"application/json": {
-								"schema": {
-									"$ref": "#/components/schemas/HTTPValidationError"
-								}
-							}
-						}
-					}
-				}
-			}
 		}
 	},
 	"components": {
@@ -1596,33 +1555,62 @@
 				"required": ["tags"],
 				"title": "AddTagsRequest"
 			},
-			"AssignItemRequest": {
-				"properties": {
-					"force": {
-						"type": "boolean",
-						"title": "Force",
-						"description": "Force assignment even if item is assigned to another user (requires admin or team-lead role)",
-						"default": false
-					}
-				},
-				"type": "object",
-				"title": "AssignItemRequest",
-				"description": "Request body for assignment endpoint."
-			},
-			"AssignmentUpdateRequest": {
+			"AgenticGroundTruthEntry-Input": {
 				"properties": {
-					"editedQuestion": {
+					"id": {
+						"type": "string",
+						"title": "Id"
+					},
+					"datasetName": {
+						"type": "string",
+						"title": "Datasetname"
+					},
+					"bucket": {
 						"anyOf": [
 							{
-								"type": "string"
+								"type": "string",
+								"format": "uuid"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Editedquestion"
+						"title": "Bucket"
 					},
-					"answer": {
+					"status": {
+						"$ref": "#/components/schemas/GroundTruthStatus",
+						"default": "draft"
+					},
+					"docType": {
+						"type": "string",
+						"title": "Doctype",
+						"default": "ground-truth-item"
+					},
+					"schemaVersion": {
+						"type": "string",
+						"title": "Schemaversion",
+						"default": "v2"
+					},
+					"manualTags": {
+						"items": {
+							"type": "string"
+						},
+						"type": "array",
+						"title": "Manualtags"
+					},
+					"computedTags": {
+						"items": {
+							"type": "string"
+						},
+						"type": "array",
+						"title": "Computedtags"
+					},
+					"comment": {
+						"type": "string",
+						"title": "Comment",
+						"default": ""
+					},
+					"assignedTo": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1631,24 +1619,27 @@
 								"type": "null"
 							}
 						],
-						"title": "Answer"
+						"title": "Assignedto"
 					},
-					"comment": {
+					"assignedAt": {
 						"anyOf": [
 							{
-								"type": "string"
+								"type": "string",
+								"format": "date-time"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Comment"
+						"title": "Assignedat"
 					},
-					"status": {
+					"updatedAt": {
+						"type": "string",
+						"format": "date-time",
+						"title": "Updatedat"
+					},
+					"updatedBy": {
 						"anyOf": [
-							{
-								"$ref": "#/components/schemas/GroundTruthStatus"
-							},
 							{
 								"type": "string"
 							},
@@ -1656,48 +1647,94 @@
 								"type": "null"
 							}
 						],
-						"title": "Status"
+						"title": "Updatedby"
 					},
-					"refs": {
+					"reviewedAt": {
 						"anyOf": [
 							{
-								"items": {
-									"$ref": "#/components/schemas/Reference"
-								},
-								"type": "array"
+								"type": "string",
+								"format": "date-time"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Refs"
+						"title": "Reviewedat"
 					},
-					"manualTags": {
+					"_etag": {
 						"anyOf": [
 							{
-								"items": {
-									"type": "string"
-								},
-								"type": "array"
+								"type": "string"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Manualtags"
+						"title": "Etag"
 					},
-					"approve": {
+					"scenarioId": {
+						"type": "string",
+						"title": "Scenarioid",
+						"default": ""
+					},
+					"history": {
+						"items": {
+							"$ref": "#/components/schemas/HistoryEntry"
+						},
+						"type": "array",
+						"title": "History"
+					},
+					"contextEntries": {
+						"items": {
+							"$ref": "#/components/schemas/ContextEntry"
+						},
+						"type": "array",
+						"title": "Contextentries"
+					},
+					"traceIds": {
 						"anyOf": [
 							{
-								"type": "boolean"
+								"additionalProperties": {
+									"type": "string"
+								},
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Approve"
+						"title": "Traceids"
 					},
-					"etag": {
+					"toolCalls": {
+						"items": {
+							"$ref": "#/components/schemas/ToolCallRecord"
+						},
+						"type": "array",
+						"title": "Toolcalls"
+					},
+					"expectedTools": {
+						"$ref": "#/components/schemas/ExpectedTools"
+					},
+					"feedback": {
+						"items": {
+							"$ref": "#/components/schemas/FeedbackEntry"
+						},
+						"type": "array",
+						"title": "Feedback"
+					},
+					"metadata": {
+						"additionalProperties": true,
+						"type": "object",
+						"title": "Metadata"
+					},
+					"plugins": {
+						"additionalProperties": {
+							"$ref": "#/components/schemas/PluginPayload"
+						},
+						"type": "object",
+						"title": "Plugins"
+					},
+					"createdBy": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1706,79 +1743,88 @@
 								"type": "null"
 							}
 						],
-						"title": "Etag"
+						"title": "Createdby"
 					},
-					"history": {
+					"createdAt": {
 						"anyOf": [
 							{
-								"items": {
-									"additionalProperties": true,
-									"type": "object"
-								},
-								"type": "array"
+								"type": "string",
+								"format": "date-time"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "History"
+						"title": "Createdat"
+					},
+					"tracePayload": {
+						"additionalProperties": true,
+						"type": "object",
+						"title": "Tracepayload"
 					}
 				},
-				"additionalProperties": true,
+				"additionalProperties": false,
 				"type": "object",
-				"title": "AssignmentUpdateRequest",
-				"description": "Payload for SME update (save draft / approve / skip / delete).\n\nUsing a Pydantic model allows camelCase -> snake_case alias handling. All fields optional; we\nonly mutate those explicitly provided (tracked via model_fields_set)."
+				"required": ["id", "datasetName"],
+				"title": "AgenticGroundTruthEntry",
+				"description": "Generic agentic-first host model.\n\nThe core contract intentionally exposes only the generic schema in OpenAPI. Legacy\nRAG-shaped payloads are translated into this shape when validating this base class so\nexisting data can be carried forward without remaining top-level contract fields."
 			},
-			"BulkImportError": {
+			"AgenticGroundTruthEntry-Output": {
 				"properties": {
-					"index": {
-						"type": "integer",
-						"title": "Index",
-						"description": "0-based position in request array"
+					"id": {
+						"type": "string",
+						"title": "Id"
 					},
-					"itemId": {
-						"anyOf": [
-							{
-								"type": "string"
-							},
-							{
-								"type": "null"
-							}
-						],
-						"title": "Itemid",
-						"description": "ID of the failed item (if available)"
+					"datasetName": {
+						"type": "string",
+						"title": "Datasetname"
 					},
-					"field": {
+					"bucket": {
 						"anyOf": [
 							{
-								"type": "string"
+								"type": "string",
+								"format": "uuid"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Field",
-						"description": "Field that caused the error (if applicable)"
+						"title": "Bucket"
 					},
-					"code": {
+					"status": {
+						"$ref": "#/components/schemas/GroundTruthStatus",
+						"default": "draft"
+					},
+					"docType": {
 						"type": "string",
-						"title": "Code",
-						"description": "Error code: INVALID_TAG, DUPLICATE_ID, CREATE_FAILED, etc."
+						"title": "Doctype",
+						"default": "ground-truth-item"
 					},
-					"message": {
+					"schemaVersion": {
 						"type": "string",
-						"title": "Message",
-						"description": "Human-readable error description"
-					}
-				},
-				"type": "object",
-				"required": ["index", "code", "message"],
-				"title": "BulkImportError",
-				"description": "Structured error for bulk import failures."
-			},
-			"ChatReference": {
-				"properties": {
-					"id": {
+						"title": "Schemaversion",
+						"default": "v2"
+					},
+					"manualTags": {
+						"items": {
+							"type": "string"
+						},
+						"type": "array",
+						"title": "Manualtags"
+					},
+					"computedTags": {
+						"items": {
+							"type": "string"
+						},
+						"type": "array",
+						"title": "Computedtags"
+					},
+					"comment": {
+						"type": "string",
+						"title": "Comment",
+						"default": ""
+					},
+					"assignedTo": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1787,20 +1833,26 @@
 								"type": "null"
 							}
 						],
-						"title": "Id"
+						"title": "Assignedto"
 					},
-					"title": {
+					"assignedAt": {
 						"anyOf": [
 							{
-								"type": "string"
+								"type": "string",
+								"format": "date-time"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Title"
+						"title": "Assignedat"
 					},
-					"url": {
+					"updatedAt": {
+						"type": "string",
+						"format": "date-time",
+						"title": "Updatedat"
+					},
+					"updatedBy": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1809,20 +1861,21 @@
 								"type": "null"
 							}
 						],
-						"title": "Url"
+						"title": "Updatedby"
 					},
-					"snippet": {
+					"reviewedAt": {
 						"anyOf": [
 							{
-								"type": "string"
+								"type": "string",
+								"format": "date-time"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Snippet"
+						"title": "Reviewedat"
 					},
-					"keyParagraph": {
+					"_etag": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1831,61 +1884,71 @@
 								"type": "null"
 							}
 						],
-						"title": "Keyparagraph"
-					}
-				},
-				"additionalProperties": true,
-				"type": "object",
-				"title": "ChatReference"
-			},
-			"ChatRequest": {
-				"properties": {
-					"message": {
+						"title": "Etag"
+					},
+					"scenarioId": {
 						"type": "string",
-						"minLength": 1,
-						"title": "Message"
+						"title": "Scenarioid",
+						"default": ""
+					},
+					"history": {
+						"items": {
+							"$ref": "#/components/schemas/HistoryEntry"
+						},
+						"type": "array",
+						"title": "History"
+					},
+					"contextEntries": {
+						"items": {
+							"$ref": "#/components/schemas/ContextEntry"
+						},
+						"type": "array",
+						"title": "Contextentries"
 					},
-					"context": {
+					"traceIds": {
 						"anyOf": [
 							{
-								"type": "string"
+								"additionalProperties": {
+									"type": "string"
+								},
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Context"
-					}
-				},
-				"type": "object",
-				"required": ["message"],
-				"title": "ChatRequest"
-			},
-			"ChatResponse": {
-				"properties": {
-					"content": {
-						"type": "string",
-						"title": "Content"
+						"title": "Traceids"
 					},
-					"references": {
+					"toolCalls": {
 						"items": {
-							"$ref": "#/components/schemas/ChatReference"
+							"$ref": "#/components/schemas/ToolCallRecord"
 						},
 						"type": "array",
-						"title": "References"
-					}
-				},
-				"type": "object",
-				"required": ["content"],
-				"title": "ChatResponse"
-			},
-			"CurationInstructionsUpdate": {
-				"properties": {
-					"instructions": {
-						"type": "string",
-						"title": "Instructions"
+						"title": "Toolcalls"
 					},
-					"_etag": {
+					"expectedTools": {
+						"$ref": "#/components/schemas/ExpectedTools"
+					},
+					"feedback": {
+						"items": {
+							"$ref": "#/components/schemas/FeedbackEntry"
+						},
+						"type": "array",
+						"title": "Feedback"
+					},
+					"metadata": {
+						"additionalProperties": true,
+						"type": "object",
+						"title": "Metadata"
+					},
+					"plugins": {
+						"additionalProperties": {
+							"$ref": "#/components/schemas/PluginPayload"
+						},
+						"type": "object",
+						"title": "Plugins"
+					},
+					"createdBy": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1894,48 +1957,34 @@
 								"type": "null"
 							}
 						],
-						"title": "Etag"
-					}
-				},
-				"type": "object",
-				"required": ["instructions"],
-				"title": "CurationInstructionsUpdate"
-			},
-			"DatasetCurationInstructions": {
-				"properties": {
-					"id": {
-						"type": "string",
-						"title": "Id"
-					},
-					"datasetName": {
-						"type": "string",
-						"title": "Datasetname"
-					},
-					"bucket": {
-						"type": "string",
-						"format": "uuid",
-						"title": "Bucket"
-					},
-					"docType": {
-						"type": "string",
-						"title": "Doctype",
-						"default": "curation-instructions"
+						"title": "Createdby"
 					},
-					"schemaVersion": {
-						"type": "string",
-						"title": "Schemaversion",
-						"default": "v1"
+					"createdAt": {
+						"anyOf": [
+							{
+								"type": "string",
+								"format": "date-time"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Createdat"
 					},
-					"instructions": {
-						"type": "string",
-						"title": "Instructions"
+					"tracePayload": {
+						"additionalProperties": true,
+						"type": "object",
+						"title": "Tracepayload"
 					},
-					"updatedAt": {
-						"type": "string",
-						"format": "date-time",
-						"title": "Updatedat"
+					"tags": {
+						"items": {
+							"type": "string"
+						},
+						"type": "array",
+						"title": "Tags",
+						"readOnly": true
 					},
-					"updatedBy": {
+					"synthQuestion": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1944,9 +1993,10 @@
 								"type": "null"
 							}
 						],
-						"title": "Updatedby"
+						"title": "Synthquestion",
+						"readOnly": true
 					},
-					"_etag": {
+					"editedQuestion": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -1955,96 +2005,88 @@
 								"type": "null"
 							}
 						],
-						"title": "Etag"
-					}
-				},
-				"type": "object",
-				"required": ["id", "datasetName", "instructions"],
-				"title": "DatasetCurationInstructions",
-				"description": "Dataset-level curation instructions document (schemaVersion v1).\n\nStored in the same Cosmos container as ground-truth items using MultiHash PK\n[/datasetName, /bucket] with bucket fixed to 0 and a stable id pattern\n\"curation-instructions|{datasetName}\"."
-			},
-			"DependencyDTO": {
-				"properties": {
-					"group": {
-						"type": "string",
-						"title": "Group"
-					},
-					"value": {
-						"type": "string",
-						"title": "Value"
-					}
-				},
-				"type": "object",
-				"required": ["group", "value"],
-				"title": "DependencyDTO"
-			},
-			"DuplicateWarning": {
-				"properties": {
-					"itemId": {
-						"type": "string",
-						"title": "Itemid",
-						"description": "Draft item identifier"
-					},
-					"duplicateId": {
-						"type": "string",
-						"title": "Duplicateid",
-						"description": "ID of the likely duplicate approved item"
+						"title": "Editedquestion",
+						"readOnly": true
 					},
-					"duplicateQuestion": {
-						"type": "string",
-						"title": "Duplicatequestion",
-						"description": "Question text from the duplicate item"
+					"answer": {
+						"anyOf": [
+							{
+								"type": "string"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Answer",
+						"readOnly": true
 					},
-					"duplicateStatus": {
-						"type": "string",
-						"title": "Duplicatestatus",
-						"description": "Status of the duplicate item"
+					"refs": {
+						"items": {
+							"$ref": "#/components/schemas/Reference"
+						},
+						"type": "array",
+						"title": "Refs",
+						"readOnly": true
 					},
-					"matchReason": {
-						"type": "string",
-						"title": "Matchreason",
-						"description": "Why this was flagged as a duplicate (e.g., 'exact question match', 'question and answer match')"
+					"totalReferences": {
+						"type": "integer",
+						"title": "Totalreferences",
+						"readOnly": true
 					}
 				},
+				"additionalProperties": false,
 				"type": "object",
 				"required": [
-					"itemId",
-					"duplicateId",
-					"duplicateQuestion",
-					"duplicateStatus",
-					"matchReason"
-				],
-				"title": "DuplicateWarning",
-				"description": "Warning about a likely duplicate of an approved item."
-			},
-			"ExpectedBehavior": {
-				"type": "string",
-				"enum": [
-					"tool:search",
-					"generation:answer",
-					"generation:need-context",
-					"generation:clarification",
-					"generation:out-of-domain"
+					"id",
+					"datasetName",
+					"tags",
+					"synthQuestion",
+					"editedQuestion",
+					"answer",
+					"refs",
+					"totalReferences"
 				],
-				"title": "ExpectedBehavior",
-				"description": "Expected behavior tags for history items in ground truth evaluation.\n\nThese tags describe what the agent should do at each turn of a conversation:\n- tool:search: Agent should perform a search/retrieval operation\n- generation:answer: Agent should generate a direct answer\n- generation:need-context: Agent should ask for more context\n- generation:clarification: Agent should ask for clarification\n- generation:out-of-domain: Agent should indicate the query is out of domain"
+				"title": "AgenticGroundTruthEntry",
+				"description": "Generic agentic-first host model.\n\nThe core contract intentionally exposes only the generic schema in OpenAPI. Legacy\nRAG-shaped payloads are translated into this shape when validating this base class so\nexisting data can be carried forward without remaining top-level contract fields."
 			},
-			"ExportDeliveryOptions": {
+			"AssignItemRequest": {
 				"properties": {
-					"mode": {
-						"type": "string",
-						"enum": ["attachment", "artifact"],
-						"title": "Mode",
-						"default": "artifact"
+					"force": {
+						"type": "boolean",
+						"title": "Force",
+						"description": "Force assignment even if item is assigned to another user (requires admin or team-lead role)",
+						"default": false
 					}
 				},
-				"additionalProperties": false,
 				"type": "object",
-				"title": "ExportDeliveryOptions"
+				"title": "AssignItemRequest",
+				"description": "Request body for assignment endpoint."
 			},
-			"ExportFilters": {
+			"AssignmentUpdateRequest": {
 				"properties": {
-					"datasetNames": {
+					"comment": {
+						"anyOf": [
+							{
+								"type": "string"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Comment"
+					},
+					"status": {
+						"anyOf": [
+							{
+								"$ref": "#/components/schemas/GroundTruthStatus"
+							},
+							{
+								"type": "string"
+							}
+						],
+						"title": "Status"
+					},
+					"manualTags": {
 						"anyOf": [
 							{
 								"items": {
@@ -2056,110 +2098,20 @@
 								"type": "null"
 							}
 						],
-						"title": "Datasetnames"
+						"title": "Manualtags"
 					},
-					"status": {
-						"type": "string",
-						"title": "Status",
-						"default": "approved"
-					}
-				},
-				"additionalProperties": false,
-				"type": "object",
-				"title": "ExportFilters"
-			},
-			"FrontendConfig": {
-				"properties": {
-					"requireReferenceVisit": {
-						"type": "boolean",
-						"title": "Requirereferencevisit"
-					},
-					"requireKeyParagraph": {
-						"type": "boolean",
-						"title": "Requirekeyparagraph"
-					},
-					"selfServeLimit": {
-						"type": "integer",
-						"title": "Selfservelimit"
-					},
-					"trustedReferenceDomains": {
-						"items": {
-							"type": "string"
-						},
-						"type": "array",
-						"title": "Trustedreferencedomains"
-					}
-				},
-				"type": "object",
-				"required": [
-					"requireReferenceVisit",
-					"requireKeyParagraph",
-					"selfServeLimit",
-					"trustedReferenceDomains"
-				],
-				"title": "FrontendConfig",
-				"description": "Frontend runtime configuration."
-			},
-			"GlossaryResponse": {
-				"properties": {
-					"version": {
-						"type": "string",
-						"title": "Version",
-						"default": "v1"
-					},
-					"groups": {
-						"items": {
-							"$ref": "#/components/schemas/TagGroupGlossaryDTO"
-						},
-						"type": "array",
-						"title": "Groups"
-					}
-				},
-				"type": "object",
-				"required": ["groups"],
-				"title": "GlossaryResponse"
-			},
-			"GroundTruthItem-Input": {
-				"properties": {
-					"id": {
-						"type": "string",
-						"title": "Id"
-					},
-					"datasetName": {
-						"type": "string",
-						"title": "Datasetname"
-					},
-					"bucket": {
+					"approve": {
 						"anyOf": [
 							{
-								"type": "string",
-								"format": "uuid"
+								"type": "boolean"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Bucket"
-					},
-					"status": {
-						"$ref": "#/components/schemas/GroundTruthStatus",
-						"default": "draft"
-					},
-					"docType": {
-						"type": "string",
-						"title": "Doctype",
-						"default": "ground-truth-item"
-					},
-					"schemaVersion": {
-						"type": "string",
-						"title": "Schemaversion",
-						"default": "v2"
-					},
-					"synthQuestion": {
-						"type": "string",
-						"title": "Synthquestion"
+						"title": "Approve"
 					},
-					"editedQuestion": {
+					"etag": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -2168,56 +2120,41 @@
 								"type": "null"
 							}
 						],
-						"title": "Editedquestion"
+						"title": "Etag"
 					},
-					"answer": {
+					"history": {
 						"anyOf": [
 							{
-								"type": "string"
+								"items": {
+									"$ref": "#/components/schemas/HistoryEntryPatch"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Answer"
-					},
-					"refs": {
-						"items": {
-							"$ref": "#/components/schemas/Reference"
-						},
-						"type": "array",
-						"title": "Refs"
-					},
-					"manualTags": {
-						"items": {
-							"type": "string"
-						},
-						"type": "array",
-						"title": "Manualtags"
-					},
-					"computedTags": {
-						"items": {
-							"type": "string"
-						},
-						"type": "array",
-						"title": "Computedtags"
+						"title": "History"
 					},
-					"comment": {
+					"contextEntries": {
 						"anyOf": [
 							{
-								"type": "string"
+								"items": {
+									"$ref": "#/components/schemas/ContextEntry"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Comment"
+						"title": "Contextentries"
 					},
-					"history": {
+					"toolCalls": {
 						"anyOf": [
 							{
 								"items": {
-									"$ref": "#/components/schemas/HistoryItem"
+									"$ref": "#/components/schemas/ToolCallRecord"
 								},
 								"type": "array"
 							},
@@ -2225,86 +2162,114 @@
 								"type": "null"
 							}
 						],
-						"title": "History"
+						"title": "Toolcalls"
+					},
+					"expectedTools": {
+						"$ref": "#/components/schemas/ExpectedTools",
+						"title": "Expectedtools"
 					},
-					"contextUsedForGeneration": {
+					"feedback": {
 						"anyOf": [
 							{
-								"type": "string"
+								"items": {
+									"$ref": "#/components/schemas/FeedbackEntry"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Contextusedforgeneration"
+						"title": "Feedback"
 					},
-					"contextSource": {
+					"metadata": {
 						"anyOf": [
 							{
-								"type": "string"
+								"additionalProperties": true,
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Contextsource"
+						"title": "Metadata"
 					},
-					"modelUsedForGeneration": {
+					"plugins": {
 						"anyOf": [
 							{
-								"type": "string"
+								"additionalProperties": {
+									"$ref": "#/components/schemas/PluginPayload"
+								},
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Modelusedforgeneration"
+						"title": "Plugins"
 					},
-					"semanticClusterNumber": {
+					"traceIds": {
 						"anyOf": [
 							{
-								"type": "integer"
+								"additionalProperties": {
+									"type": "string"
+								},
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Semanticclusternumber"
+						"title": "Traceids"
 					},
-					"weight": {
+					"tracePayload": {
 						"anyOf": [
 							{
-								"type": "number"
+								"additionalProperties": true,
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Weight"
+						"title": "Tracepayload"
 					},
-					"samplingBucket": {
+					"scenarioId": {
 						"anyOf": [
 							{
-								"type": "integer"
+								"type": "string"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Samplingbucket"
+						"title": "Scenarioid"
+					}
+				},
+				"additionalProperties": true,
+				"type": "object",
+				"title": "AssignmentUpdateRequest"
+			},
+			"BulkImportError": {
+				"properties": {
+					"index": {
+						"type": "integer",
+						"title": "Index",
+						"description": "0-based position in request array"
 					},
-					"questionLength": {
+					"itemId": {
 						"anyOf": [
 							{
-								"type": "integer"
+								"type": "string"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Questionlength"
+						"title": "Itemid",
+						"description": "ID of the failed item (if available)"
 					},
-					"assignedTo": {
+					"field": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -2313,19 +2278,89 @@
 								"type": "null"
 							}
 						],
-						"title": "Assignedto"
+						"title": "Field",
+						"description": "Field that caused the error (if applicable)"
 					},
-					"assignedAt": {
+					"code": {
+						"type": "string",
+						"title": "Code",
+						"description": "Error code: INVALID_TAG, DUPLICATE_ID, CREATE_FAILED, etc."
+					},
+					"message": {
+						"type": "string",
+						"title": "Message",
+						"description": "Human-readable error description"
+					}
+				},
+				"type": "object",
+				"required": ["index", "code", "message"],
+				"title": "BulkImportError"
+			},
+			"ContextEntry": {
+				"properties": {
+					"key": {
+						"type": "string",
+						"title": "Key"
+					},
+					"value": {
+						"title": "Value"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"required": ["key", "value"],
+				"title": "ContextEntry"
+			},
+			"CurationInstructionsUpdate": {
+				"properties": {
+					"instructions": {
+						"type": "string",
+						"title": "Instructions"
+					},
+					"_etag": {
 						"anyOf": [
 							{
-								"type": "string",
-								"format": "date-time"
+								"type": "string"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Assignedat"
+						"title": "Etag"
+					}
+				},
+				"type": "object",
+				"required": ["instructions"],
+				"title": "CurationInstructionsUpdate"
+			},
+			"DatasetCurationInstructions": {
+				"properties": {
+					"id": {
+						"type": "string",
+						"title": "Id"
+					},
+					"datasetName": {
+						"type": "string",
+						"title": "Datasetname"
+					},
+					"bucket": {
+						"type": "string",
+						"format": "uuid",
+						"title": "Bucket"
+					},
+					"docType": {
+						"type": "string",
+						"title": "Doctype",
+						"default": "curation-instructions"
+					},
+					"schemaVersion": {
+						"type": "string",
+						"title": "Schemaversion",
+						"default": "v1"
+					},
+					"instructions": {
+						"type": "string",
+						"title": "Instructions"
 					},
 					"updatedAt": {
 						"type": "string",
@@ -2343,18 +2378,6 @@
 						],
 						"title": "Updatedby"
 					},
-					"reviewedAt": {
-						"anyOf": [
-							{
-								"type": "string",
-								"format": "date-time"
-							},
-							{
-								"type": "null"
-							}
-						],
-						"title": "Reviewedat"
-					},
 					"_etag": {
 						"anyOf": [
 							{
@@ -2365,100 +2388,236 @@
 							}
 						],
 						"title": "Etag"
+					}
+				},
+				"type": "object",
+				"required": ["id", "datasetName", "instructions"],
+				"title": "DatasetCurationInstructions"
+			},
+			"DependencyDTO": {
+				"properties": {
+					"group": {
+						"type": "string",
+						"title": "Group"
 					},
-					"totalReferences": {
-						"type": "integer",
-						"title": "Totalreferences",
-						"default": 0
+					"value": {
+						"type": "string",
+						"title": "Value"
 					}
 				},
 				"type": "object",
-				"required": ["id", "datasetName", "synthQuestion"],
-				"title": "GroundTruthItem",
-				"description": "Canonical Ground Truth item aligned to wire schema (schemaVersion v1).\n\nAll fields with camelCase wire names use aliases; we accept both field names and aliases\non input (populate_by_name=True) and always serialize using by_alias."
+				"required": ["group", "value"],
+				"title": "DependencyDTO"
 			},
-			"GroundTruthItem-Output": {
+			"DuplicateWarning": {
 				"properties": {
-					"id": {
+					"itemId": {
 						"type": "string",
-						"title": "Id"
+						"title": "Itemid",
+						"description": "Draft item identifier"
 					},
-					"datasetName": {
+					"duplicateId": {
 						"type": "string",
-						"title": "Datasetname"
+						"title": "Duplicateid",
+						"description": "ID of the likely duplicate approved item"
 					},
-					"bucket": {
+					"duplicateQuestion": {
+						"type": "string",
+						"title": "Duplicatequestion",
+						"description": "Question text from the duplicate item"
+					},
+					"duplicateStatus": {
+						"type": "string",
+						"title": "Duplicatestatus",
+						"description": "Status of the duplicate item"
+					},
+					"matchReason": {
+						"type": "string",
+						"title": "Matchreason",
+						"description": "Why this was flagged as a duplicate (e.g., 'exact question match', 'question and answer match')"
+					}
+				},
+				"type": "object",
+				"required": [
+					"itemId",
+					"duplicateId",
+					"duplicateQuestion",
+					"duplicateStatus",
+					"matchReason"
+				],
+				"title": "DuplicateWarning",
+				"description": "Warning about a likely duplicate of an approved item."
+			},
+			"ExpectedTools": {
+				"properties": {
+					"required": {
+						"items": {
+							"$ref": "#/components/schemas/ToolExpectation"
+						},
+						"type": "array",
+						"title": "Required"
+					},
+					"optional": {
+						"items": {
+							"$ref": "#/components/schemas/ToolExpectation"
+						},
+						"type": "array",
+						"title": "Optional"
+					},
+					"notNeeded": {
+						"items": {
+							"$ref": "#/components/schemas/ToolExpectation"
+						},
+						"type": "array",
+						"title": "Notneeded"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"title": "ExpectedTools",
+				"description": "Tool expectations. Tools are implicitly allowed unless listed here."
+			},
+			"ExportDeliveryOptions": {
+				"properties": {
+					"mode": {
+						"type": "string",
+						"enum": ["attachment", "artifact"],
+						"title": "Mode",
+						"default": "artifact"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"title": "ExportDeliveryOptions"
+			},
+			"ExportFilters": {
+				"properties": {
+					"datasetNames": {
 						"anyOf": [
 							{
-								"type": "string",
-								"format": "uuid"
+								"items": {
+									"type": "string"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Bucket"
+						"title": "Datasetnames"
 					},
 					"status": {
-						"$ref": "#/components/schemas/GroundTruthStatus",
-						"default": "draft"
-					},
-					"docType": {
 						"type": "string",
-						"title": "Doctype",
-						"default": "ground-truth-item"
-					},
-					"schemaVersion": {
+						"title": "Status",
+						"default": "approved"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"title": "ExportFilters"
+			},
+			"FeedbackEntry": {
+				"properties": {
+					"source": {
 						"type": "string",
-						"title": "Schemaversion",
-						"default": "v2"
+						"title": "Source",
+						"default": ""
 					},
-					"synthQuestion": {
+					"values": {
+						"additionalProperties": true,
+						"type": "object",
+						"title": "Values"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"title": "FeedbackEntry"
+			},
+			"FrontendConfig": {
+				"properties": {
+					"requireReferenceVisit": {
+						"type": "boolean",
+						"title": "Requirereferencevisit"
+					},
+					"requireKeyParagraph": {
+						"type": "boolean",
+						"title": "Requirekeyparagraph"
+					},
+					"selfServeLimit": {
+						"type": "integer",
+						"title": "Selfservelimit"
+					},
+					"trustedReferenceDomains": {
+						"items": {
+							"type": "string"
+						},
+						"type": "array",
+						"title": "Trustedreferencedomains"
+					}
+				},
+				"type": "object",
+				"required": [
+					"requireReferenceVisit",
+					"requireKeyParagraph",
+					"selfServeLimit",
+					"trustedReferenceDomains"
+				],
+				"title": "FrontendConfig",
+				"description": "Frontend runtime configuration."
+			},
+			"GlossaryResponse": {
+				"properties": {
+					"version": {
 						"type": "string",
-						"title": "Synthquestion"
+						"title": "Version",
+						"default": "v1"
 					},
-					"editedQuestion": {
+					"groups": {
+						"items": {
+							"$ref": "#/components/schemas/TagGroupGlossaryDTO"
+						},
+						"type": "array",
+						"title": "Groups"
+					}
+				},
+				"type": "object",
+				"required": ["groups"],
+				"title": "GlossaryResponse"
+			},
+			"GroundTruthListResponse": {
+				"properties": {
+					"items": {
+						"items": {
+							"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
+						},
+						"type": "array",
+						"title": "Items"
+					},
+					"pagination": {
+						"$ref": "#/components/schemas/PaginationMetadata"
+					}
+				},
+				"type": "object",
+				"required": ["items", "pagination"],
+				"title": "GroundTruthListResponse"
+			},
+			"GroundTruthStatus": {
+				"type": "string",
+				"enum": ["draft", "approved", "deleted", "skipped"],
+				"title": "GroundTruthStatus"
+			},
+			"GroundTruthUpdateRequest": {
+				"properties": {
+					"status": {
 						"anyOf": [
 							{
-								"type": "string"
+								"$ref": "#/components/schemas/GroundTruthStatus"
 							},
-							{
-								"type": "null"
-							}
-						],
-						"title": "Editedquestion"
-					},
-					"answer": {
-						"anyOf": [
 							{
 								"type": "string"
-							},
-							{
-								"type": "null"
 							}
 						],
-						"title": "Answer"
-					},
-					"refs": {
-						"items": {
-							"$ref": "#/components/schemas/Reference"
-						},
-						"type": "array",
-						"title": "Refs"
-					},
-					"manualTags": {
-						"items": {
-							"type": "string"
-						},
-						"type": "array",
-						"title": "Manualtags"
-					},
-					"computedTags": {
-						"items": {
-							"type": "string"
-						},
-						"type": "array",
-						"title": "Computedtags"
+						"title": "Status"
 					},
 					"comment": {
 						"anyOf": [
@@ -2475,7 +2634,7 @@
 						"anyOf": [
 							{
 								"items": {
-									"$ref": "#/components/schemas/HistoryItem"
+									"$ref": "#/components/schemas/HistoryEntryPatch"
 								},
 								"type": "array"
 							},
@@ -2485,112 +2644,119 @@
 						],
 						"title": "History"
 					},
-					"contextUsedForGeneration": {
+					"contextEntries": {
 						"anyOf": [
 							{
-								"type": "string"
+								"items": {
+									"$ref": "#/components/schemas/ContextEntry"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Contextusedforgeneration"
+						"title": "Contextentries"
 					},
-					"contextSource": {
+					"toolCalls": {
 						"anyOf": [
 							{
-								"type": "string"
+								"items": {
+									"$ref": "#/components/schemas/ToolCallRecord"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Contextsource"
+						"title": "Toolcalls"
 					},
-					"modelUsedForGeneration": {
-						"anyOf": [
-							{
-								"type": "string"
-							},
-							{
-								"type": "null"
-							}
-						],
-						"title": "Modelusedforgeneration"
+					"expectedTools": {
+						"$ref": "#/components/schemas/ExpectedTools",
+						"title": "Expectedtools"
 					},
-					"semanticClusterNumber": {
+					"feedback": {
 						"anyOf": [
 							{
-								"type": "integer"
+								"items": {
+									"$ref": "#/components/schemas/FeedbackEntry"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Semanticclusternumber"
+						"title": "Feedback"
 					},
-					"weight": {
+					"metadata": {
 						"anyOf": [
 							{
-								"type": "number"
+								"additionalProperties": true,
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Weight"
+						"title": "Metadata"
 					},
-					"samplingBucket": {
+					"plugins": {
 						"anyOf": [
 							{
-								"type": "integer"
+								"additionalProperties": {
+									"$ref": "#/components/schemas/PluginPayload"
+								},
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Samplingbucket"
+						"title": "Plugins"
 					},
-					"questionLength": {
+					"manualTags": {
 						"anyOf": [
 							{
-								"type": "integer"
+								"items": {
+									"type": "string"
+								},
+								"type": "array"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Questionlength"
+						"title": "Manualtags"
 					},
-					"assignedTo": {
+					"traceIds": {
 						"anyOf": [
 							{
-								"type": "string"
+								"additionalProperties": {
+									"type": "string"
+								},
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Assignedto"
+						"title": "Traceids"
 					},
-					"assignedAt": {
+					"tracePayload": {
 						"anyOf": [
 							{
-								"type": "string",
-								"format": "date-time"
+								"additionalProperties": true,
+								"type": "object"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Assignedat"
-					},
-					"updatedAt": {
-						"type": "string",
-						"format": "date-time",
-						"title": "Updatedat"
+						"title": "Tracepayload"
 					},
-					"updatedBy": {
+					"scenarioId": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -2599,21 +2765,9 @@
 								"type": "null"
 							}
 						],
-						"title": "Updatedby"
-					},
-					"reviewedAt": {
-						"anyOf": [
-							{
-								"type": "string",
-								"format": "date-time"
-							},
-							{
-								"type": "null"
-							}
-						],
-						"title": "Reviewedat"
+						"title": "Scenarioid"
 					},
-					"_etag": {
+					"etag": {
 						"anyOf": [
 							{
 								"type": "string"
@@ -2623,48 +2777,11 @@
 							}
 						],
 						"title": "Etag"
-					},
-					"totalReferences": {
-						"type": "integer",
-						"title": "Totalreferences",
-						"default": 0
-					},
-					"tags": {
-						"items": {
-							"type": "string"
-						},
-						"type": "array",
-						"title": "Tags",
-						"description": "Return a merged, sorted view of manual and computed tags.",
-						"readOnly": true
-					}
-				},
-				"type": "object",
-				"required": ["id", "datasetName", "synthQuestion", "tags"],
-				"title": "GroundTruthItem",
-				"description": "Canonical Ground Truth item aligned to wire schema (schemaVersion v1).\n\nAll fields with camelCase wire names use aliases; we accept both field names and aliases\non input (populate_by_name=True) and always serialize using by_alias."
-			},
-			"GroundTruthListResponse": {
-				"properties": {
-					"items": {
-						"items": {
-							"$ref": "#/components/schemas/GroundTruthItem-Output"
-						},
-						"type": "array",
-						"title": "Items"
-					},
-					"pagination": {
-						"$ref": "#/components/schemas/PaginationMetadata"
 					}
 				},
+				"additionalProperties": true,
 				"type": "object",
-				"required": ["items", "pagination"],
-				"title": "GroundTruthListResponse"
-			},
-			"GroundTruthStatus": {
-				"type": "string",
-				"enum": ["draft", "approved", "deleted", "skipped"],
-				"title": "GroundTruthStatus"
+				"title": "GroundTruthUpdateRequest"
 			},
 			"HTTPValidationError": {
 				"properties": {
@@ -2679,54 +2796,44 @@
 				"type": "object",
 				"title": "HTTPValidationError"
 			},
-			"HistoryItem": {
+			"HistoryEntry": {
 				"properties": {
 					"role": {
-						"$ref": "#/components/schemas/HistoryItemRole"
+						"type": "string",
+						"title": "Role"
 					},
 					"msg": {
 						"type": "string",
 						"title": "Msg"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"required": ["role", "msg"],
+				"title": "HistoryEntry"
+			},
+			"HistoryEntryPatch": {
+				"properties": {
+					"role": {
+						"type": "string",
+						"title": "Role"
 					},
-					"refs": {
-						"anyOf": [
-							{
-								"items": {
-									"$ref": "#/components/schemas/Reference"
-								},
-								"type": "array"
-							},
-							{
-								"type": "null"
-							}
-						],
-						"title": "Refs"
-					},
-					"expectedBehavior": {
+					"msg": {
 						"anyOf": [
 							{
-								"items": {
-									"$ref": "#/components/schemas/ExpectedBehavior"
-								},
-								"type": "array"
+								"type": "string"
 							},
 							{
 								"type": "null"
 							}
 						],
-						"title": "Expectedbehavior",
-						"description": "Expected behavior(s) for this turn in the conversation (e.g., tool:search, generation:answer)"
+						"title": "Msg"
 					}
 				},
+				"additionalProperties": true,
 				"type": "object",
-				"required": ["role", "msg"],
-				"title": "HistoryItem",
-				"description": "Represents a single item in the multi-turn history."
-			},
-			"HistoryItemRole": {
-				"type": "string",
-				"enum": ["user", "assistant"],
-				"title": "HistoryItemRole"
+				"required": ["role"],
+				"title": "HistoryEntryPatch"
 			},
 			"ImportBulkResponse": {
 				"properties": {
@@ -2857,8 +2964,29 @@
 					"hasNext",
 					"hasPrev"
 				],
-				"title": "PaginationMetadata",
-				"description": "Pagination metadata for list responses."
+				"title": "PaginationMetadata"
+			},
+			"PluginPayload": {
+				"properties": {
+					"kind": {
+						"type": "string",
+						"title": "Kind"
+					},
+					"version": {
+						"type": "string",
+						"title": "Version",
+						"default": "1.0"
+					},
+					"data": {
+						"additionalProperties": true,
+						"type": "object",
+						"title": "Data"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"required": ["kind"],
+				"title": "PluginPayload"
 			},
 			"RecomputeTagsResponse": {
 				"properties": {
@@ -2985,7 +3113,7 @@
 				"type": "object",
 				"required": ["url"],
 				"title": "Reference",
-				"description": "Wire reference object.\n\n{ url, title, content, keyExcerpt, type, bonus, messageIndex }"
+				"description": "Legacy RAG reference object retained for compatibility helpers and tests."
 			},
 			"RemoveTagsRequest": {
 				"properties": {
@@ -3005,7 +3133,7 @@
 				"properties": {
 					"assigned": {
 						"items": {
-							"$ref": "#/components/schemas/GroundTruthItem-Output"
+							"$ref": "#/components/schemas/AgenticGroundTruthEntry-Output"
 						},
 						"type": "array",
 						"title": "Assigned"
@@ -3269,6 +3397,115 @@
 				"required": ["groups"],
 				"title": "TagSchemaResponse"
 			},
+			"ToolCallRecord": {
+				"properties": {
+					"id": {
+						"type": "string",
+						"title": "Id",
+						"default": ""
+					},
+					"name": {
+						"type": "string",
+						"title": "Name"
+					},
+					"callType": {
+						"type": "string",
+						"enum": ["tool", "subagent"],
+						"title": "Calltype",
+						"default": "tool"
+					},
+					"arguments": {
+						"anyOf": [
+							{
+								"additionalProperties": true,
+								"type": "object"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Arguments"
+					},
+					"agent": {
+						"anyOf": [
+							{
+								"type": "string"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Agent"
+					},
+					"stepNumber": {
+						"anyOf": [
+							{
+								"type": "integer"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Stepnumber"
+					},
+					"parallelGroup": {
+						"anyOf": [
+							{
+								"type": "string"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Parallelgroup"
+					},
+					"parentCallId": {
+						"anyOf": [
+							{
+								"type": "string"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Parentcallid"
+					},
+					"response": {
+						"title": "Response"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"required": ["name"],
+				"title": "ToolCallRecord"
+			},
+			"ToolExpectation": {
+				"properties": {
+					"name": {
+						"type": "string",
+						"title": "Name"
+					},
+					"arguments": {
+						"anyOf": [
+							{
+								"additionalProperties": true,
+								"type": "object"
+							},
+							{
+								"type": "string"
+							},
+							{
+								"type": "null"
+							}
+						],
+						"title": "Arguments"
+					}
+				},
+				"additionalProperties": false,
+				"type": "object",
+				"required": ["name"],
+				"title": "ToolExpectation"
+			},
 			"ValidationError": {
 				"properties": {
 					"loc": {
@@ -3318,8 +3555,7 @@
 				},
 				"type": "object",
 				"required": ["total", "succeeded", "failed"],
-				"title": "ValidationSummary",
-				"description": "Summary statistics for bulk import."
+				"title": "ValidationSummary"
 			}
 		}
 	}
diff --git a/frontend/src/components/app/QuestionsExplorer.example.tsx b/frontend/src/components/app/QuestionsExplorer.example.tsx
index 271a413..0ca7862 100644
--- a/frontend/src/components/app/QuestionsExplorer.example.tsx
+++ b/frontend/src/components/app/QuestionsExplorer.example.tsx
@@ -15,13 +15,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-001",
 		question: "What is the capital of France?",
 		answer: "Paris",
-		references: [
-			{
-				id: "ref-1",
-				title: "Geography textbook",
-				url: "https://example.com/geo101",
-			},
-		],
 		status: "approved",
 		providerId: "json",
 		views: 150,
@@ -34,18 +27,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-002",
 		question: "How does photosynthesis work?",
 		answer: "Photosynthesis converts light energy into chemical energy...",
-		references: [
-			{
-				id: "ref-2",
-				title: "Biology handbook",
-				url: "https://example.com/bio201",
-			},
-			{
-				id: "ref-3",
-				title: "Science journal article",
-				url: "https://example.com/sci-123",
-			},
-		],
 		status: "draft",
 		providerId: "json",
 		views: 45,
@@ -58,23 +39,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-003",
 		question: "What is quantum computing?",
 		answer: "Quantum computing uses quantum-mechanical phenomena...",
-		references: [
-			{
-				id: "ref-4",
-				title: "Quantum computing primer",
-				url: "https://quantum101.com",
-			},
-			{
-				id: "ref-5",
-				title: "IEEE paper",
-				url: "https://ieeexplore.ieee.org/456",
-			},
-			{
-				id: "ref-6",
-				title: "Research article",
-				url: "https://research.org/789",
-			},
-		],
 		status: "approved",
 		providerId: "json",
 		views: 230,
@@ -87,7 +51,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-004",
 		question: "Explain machine learning basics",
 		answer: "Machine learning is a subset of artificial intelligence...",
-		references: [],
 		status: "deleted",
 		providerId: "json",
 		views: 89,
@@ -100,20 +63,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-005",
 		question: "What are the benefits of exercise?",
 		answer: "Regular exercise improves cardiovascular health...",
-		references: [
-			{ id: "ref-7", title: "Health guidelines", url: "https://health.gov" },
-			{
-				id: "ref-8",
-				title: "Medical study",
-				url: "https://example.com/med-321",
-			},
-			{
-				id: "ref-9",
-				title: "Fitness research",
-				url: "https://fitness-research.org",
-			},
-			{ id: "ref-10", title: "WHO report", url: "https://who.int/report" },
-		],
 		status: "approved",
 		providerId: "json",
 		views: 320,
@@ -127,13 +76,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "How do neural networks work?",
 		answer:
 			"Neural networks are computing systems inspired by biological neural networks...",
-		references: [
-			{
-				id: "ref-11",
-				title: "Deep Learning textbook",
-				url: "https://deeplearning.net",
-			},
-		],
 		status: "draft",
 		providerId: "json",
 		views: 198,
@@ -146,18 +88,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-007",
 		question: "What is blockchain technology?",
 		answer: "Blockchain is a distributed ledger technology...",
-		references: [
-			{
-				id: "ref-12",
-				title: "Blockchain explained",
-				url: "https://blockchain.com/guide",
-			},
-			{
-				id: "ref-13",
-				title: "Distributed systems paper",
-				url: "https://arxiv.org/abs/12345",
-			},
-		],
 		status: "approved",
 		providerId: "json",
 		views: 412,
@@ -170,7 +100,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-008",
 		question: "Explain the water cycle",
 		answer: "The water cycle describes the continuous movement of water...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 276,
@@ -183,29 +112,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-009",
 		question: "What causes climate change?",
 		answer: "Climate change is primarily caused by greenhouse gas emissions...",
-		references: [
-			{ id: "ref-14", title: "IPCC Report", url: "https://ipcc.ch/report" },
-			{
-				id: "ref-15",
-				title: "Climate science overview",
-				url: "https://climate.gov/science",
-			},
-			{
-				id: "ref-16",
-				title: "NASA climate data",
-				url: "https://climate.nasa.gov",
-			},
-			{
-				id: "ref-17",
-				title: "Environmental study",
-				url: "https://nature.com/climate-123",
-			},
-			{
-				id: "ref-18",
-				title: "Global warming research",
-				url: "https://science.org/warming",
-			},
-		],
 		status: "draft",
 		providerId: "json",
 		views: 523,
@@ -218,18 +124,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-010",
 		question: "How does DNA replication work?",
 		answer: "DNA replication is the process of copying DNA molecules...",
-		references: [
-			{
-				id: "ref-19",
-				title: "Molecular biology textbook",
-				url: "https://example.com/molbio",
-			},
-			{
-				id: "ref-20",
-				title: "DNA research article",
-				url: "https://pubmed.com/dna-456",
-			},
-		],
 		status: "approved",
 		providerId: "json",
 		views: 145,
@@ -243,23 +137,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What is the theory of relativity?",
 		answer:
 			"Einstein's theory of relativity revolutionized our understanding of space and time...",
-		references: [
-			{
-				id: "ref-21",
-				title: "Einstein's papers",
-				url: "https://einsteinpapers.org",
-			},
-			{
-				id: "ref-22",
-				title: "Physics textbook",
-				url: "https://physics.org/relativity",
-			},
-			{
-				id: "ref-23",
-				title: "Space-time explained",
-				url: "https://nasa.gov/spacetime",
-			},
-		],
 		status: "approved",
 		providerId: "json",
 		views: 387,
@@ -273,13 +150,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "How do vaccines work?",
 		answer:
 			"Vaccines work by training the immune system to recognize pathogens...",
-		references: [
-			{
-				id: "ref-24",
-				title: "CDC vaccine information",
-				url: "https://cdc.gov/vaccines",
-			},
-		],
 		status: "draft",
 		providerId: "json",
 		views: 612,
@@ -291,7 +161,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-013",
 		question: "What is cloud computing?",
 		answer: "Cloud computing delivers computing services over the internet...",
-		references: [],
 		status: "deleted",
 		providerId: "json",
 		views: 298,
@@ -304,18 +173,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "Explain Newton's laws of motion",
 		answer:
 			"Newton's three laws describe the relationship between objects and forces...",
-		references: [
-			{
-				id: "ref-25",
-				title: "Classical mechanics",
-				url: "https://physics.org/mechanics",
-			},
-			{
-				id: "ref-26",
-				title: "Newton's Principia",
-				url: "https://principia.org",
-			},
-		],
 		status: "approved",
 		providerId: "json",
 		views: 165,
@@ -327,20 +184,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-015",
 		question: "What is artificial intelligence?",
 		answer: "AI is the simulation of human intelligence by machines...",
-		references: [
-			{ id: "ref-27", title: "AI overview", url: "https://ai.google" },
-			{
-				id: "ref-28",
-				title: "Machine intelligence paper",
-				url: "https://arxiv.org/ai-789",
-			},
-			{
-				id: "ref-29",
-				title: "Turing test article",
-				url: "https://stanford.edu/turing",
-			},
-			{ id: "ref-30", title: "AI history", url: "https://mit.edu/ai-history" },
-		],
 		status: "draft",
 		providerId: "json",
 		views: 734,
@@ -353,7 +196,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "How do black holes form?",
 		answer:
 			"Black holes form when massive stars collapse at the end of their life cycle...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 445,
@@ -365,7 +207,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-017",
 		question: "What is cryptocurrency?",
 		answer: "Cryptocurrency is a digital currency secured by cryptography...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 521,
@@ -377,7 +218,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-018",
 		question: "Explain the concept of entropy",
 		answer: "Entropy is a measure of disorder or randomness in a system...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 187,
@@ -390,7 +230,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What are stem cells?",
 		answer:
 			"Stem cells are undifferentiated cells capable of developing into various cell types...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 354,
@@ -402,7 +241,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-020",
 		question: "How does the internet work?",
 		answer: "The internet is a global network of interconnected computers...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 289,
@@ -415,7 +253,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What is natural selection?",
 		answer:
 			"Natural selection is the process by which organisms better adapted survive...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 423,
@@ -428,7 +265,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "Explain quantum entanglement",
 		answer:
 			"Quantum entanglement is a phenomenon where particles remain connected...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 267,
@@ -441,7 +277,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What causes earthquakes?",
 		answer:
 			"Earthquakes occur when energy is released from tectonic plate movements...",
-		references: [],
 		status: "deleted",
 		providerId: "json",
 		views: 198,
@@ -454,7 +289,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "How do solar panels work?",
 		answer:
 			"Solar panels convert sunlight into electricity using photovoltaic cells...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 512,
@@ -466,7 +300,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-025",
 		question: "What is gene editing?",
 		answer: "Gene editing allows scientists to modify DNA sequences...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 389,
@@ -479,7 +312,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "Explain the greenhouse effect",
 		answer:
 			"The greenhouse effect is the warming of Earth's surface and atmosphere...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 456,
@@ -492,7 +324,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What is machine vision?",
 		answer:
 			"Machine vision enables computers to interpret visual information...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 312,
@@ -504,7 +335,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-028",
 		question: "How does GPS work?",
 		answer: "GPS uses satellites to determine precise location on Earth...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 234,
@@ -517,7 +347,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What are exoplanets?",
 		answer:
 			"Exoplanets are planets that orbit stars outside our solar system...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 378,
@@ -529,7 +358,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-030",
 		question: "Explain nuclear fusion",
 		answer: "Nuclear fusion is the process that powers the sun...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 521,
@@ -542,7 +370,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What is nanotechnology?",
 		answer:
 			"Nanotechnology involves manipulating matter at the atomic scale...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 267,
@@ -554,7 +381,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-032",
 		question: "How do antibiotics work?",
 		answer: "Antibiotics kill or inhibit the growth of bacteria...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 445,
@@ -567,7 +393,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What is dark matter?",
 		answer:
 			"Dark matter is an invisible form of matter that makes up most of the universe...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 598,
@@ -579,7 +404,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-034",
 		question: "Explain the Big Bang theory",
 		answer: "The Big Bang theory describes the origin of the universe...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 487,
@@ -591,7 +415,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-035",
 		question: "What is cybersecurity?",
 		answer: "Cybersecurity protects computer systems from digital attacks...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 623,
@@ -603,7 +426,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-036",
 		question: "How does the human brain work?",
 		answer: "The brain processes information through billions of neurons...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 712,
@@ -615,7 +437,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-037",
 		question: "What is 5G technology?",
 		answer: "5G is the fifth generation of cellular network technology...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 356,
@@ -628,7 +449,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "Explain plate tectonics",
 		answer:
 			"Plate tectonics describes the movement of Earth's lithospheric plates...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 298,
@@ -640,7 +460,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-039",
 		question: "What is renewable energy?",
 		answer: "Renewable energy comes from naturally replenishing sources...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 534,
@@ -652,7 +471,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-040",
 		question: "How do batteries work?",
 		answer: "Batteries convert chemical energy into electrical energy...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 412,
@@ -664,7 +482,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-041",
 		question: "What is the immune system?",
 		answer: "The immune system defends the body against harmful pathogens...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 467,
@@ -676,7 +493,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-042",
 		question: "Explain deep learning",
 		answer: "Deep learning uses neural networks with multiple layers...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 589,
@@ -689,7 +505,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What is bioengineering?",
 		answer:
 			"Bioengineering applies engineering principles to biological systems...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 321,
@@ -702,7 +517,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "How do superconductors work?",
 		answer:
 			"Superconductors conduct electricity with zero resistance at low temperatures...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 245,
@@ -714,7 +528,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-045",
 		question: "What is augmented reality?",
 		answer: "AR overlays digital information onto the real world...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 678,
@@ -726,7 +539,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-046",
 		question: "Explain the carbon cycle",
 		answer: "The carbon cycle describes how carbon moves through ecosystems...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 334,
@@ -739,7 +551,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "What is quantum computing used for?",
 		answer:
 			"Quantum computers solve complex problems beyond classical computers...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 501,
@@ -752,7 +563,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "How does protein synthesis work?",
 		answer:
 			"Protein synthesis involves transcription and translation of genetic code...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 287,
@@ -764,7 +574,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		id: "gt-049",
 		question: "What is edge computing?",
 		answer: "Edge computing processes data closer to where it's generated...",
-		references: [],
 		status: "draft",
 		providerId: "json",
 		views: 423,
@@ -777,7 +586,6 @@ const sampleItems: QuestionsExplorerItem[] = [
 		question: "Explain the Doppler effect",
 		answer:
 			"The Doppler effect is the change in frequency due to relative motion...",
-		references: [],
 		status: "approved",
 		providerId: "json",
 		views: 198,
diff --git a/frontend/src/components/app/QuestionsExplorer.tsx b/frontend/src/components/app/QuestionsExplorer.tsx
index 0ceb9f9..a188e4b 100644
--- a/frontend/src/components/app/QuestionsExplorer.tsx
+++ b/frontend/src/components/app/QuestionsExplorer.tsx
@@ -1,11 +1,13 @@
 import { Lock } from "lucide-react";
 import { useEffect, useId, useMemo, useRef, useState } from "react";
+import useTags from "../../hooks/useTags";
 import type { GroundTruthItem } from "../../models/groundTruth";
+import { getLastAgentTurn, getQueuePreview } from "../../models/groundTruth";
 import { cn } from "../../models/utils";
+import { getExplorerExtensions } from "../../registry/ExplorerExtensions";
 import { fetchAvailableDatasets } from "../../services/datasets";
 import type { GroundTruthListPagination } from "../../services/groundTruths";
 import { listAllGroundTruths } from "../../services/groundTruths";
-import { fetchTagsWithComputed } from "../../services/tags";
 import type {
 	FilterState,
 	FilterType,
@@ -33,6 +35,19 @@ export interface QuestionsExplorerItem extends GroundTruthItem {
 	reviewedAt?: string | null;
 }
 
+// Module-level constant: stable reference so effects that read defaultFilter
+// do not need to list it as a dependency (it never changes).
+const defaultFilter: FilterState = {
+	status: "all",
+	dataset: "all",
+	tags: { include: [], exclude: [] },
+	itemId: "",
+	refUrl: "",
+	keyword: "",
+	sortColumn: null,
+	sortDirection: "desc",
+};
+
 interface QuestionsExplorerProps {
 	items?: QuestionsExplorerItem[];
 	onAssign: (item: QuestionsExplorerItem) => void | Promise<void>;
@@ -49,6 +64,9 @@ export default function QuestionsExplorer({
 	className,
 }: QuestionsExplorerProps) {
 	const datasetFilterId = useId();
+	const itemIdFilterId = useId();
+	const referenceUrlFilterId = useId();
+	const keywordFilterId = useId();
 	const itemsPerPageId = useId();
 
 	// Use a ref to track the previous filter state to detect when filters change
@@ -56,18 +74,7 @@ export default function QuestionsExplorer({
 
 	// Flag to track whether URL has been synchronized (prevent infinite loops)
 	const urlSyncedRef = useRef(false);
-
-	// Default filter state
-	const defaultFilter: FilterState = {
-		status: "all",
-		dataset: "all",
-		tags: { include: [], exclude: [] },
-		itemId: "",
-		refUrl: "",
-		keyword: "",
-		sortColumn: null,
-		sortDirection: "desc",
-	};
+	const listRequestIdRef = useRef(0);
 
 	// Initialize filter state from URL parameters
 	const initializeFilterStateFromUrl = (): FilterState => {
@@ -123,8 +130,7 @@ export default function QuestionsExplorer({
 	>(undefined);
 	const [isLoading, setIsLoading] = useState(false);
 	const [loadError, setLoadError] = useState<string | null>(null);
-	const [manualTags, setManualTags] = useState<string[]>([]);
-	const [computedTags, setComputedTags] = useState<string[]>([]);
+	const { manualTags, computedTags } = useTags();
 	const [availableDatasets, setAvailableDatasets] = useState<string[]>([]);
 	const [expandedTagRows, setExpandedTagRows] = useState<Set<string>>(
 		new Set(),
@@ -161,24 +167,23 @@ export default function QuestionsExplorer({
 		sortDirection,
 	]);
 
-	// Fetch available tags and datasets from backend
+	// Fetch available datasets from backend. Tag metadata now comes from the
+	// shared useTags() service-backed cache so explorer does not duplicate tag reads.
 	useEffect(() => {
-		let cancelled = false;
+		const controller = new AbortController();
 
-		Promise.all([fetchTagsWithComputed(), fetchAvailableDatasets()])
-			.then(([tagsResult, datasets]) => {
-				if (cancelled) return;
-				setManualTags(tagsResult.manualTags);
-				setComputedTags(tagsResult.computedTags);
+		fetchAvailableDatasets(false, controller.signal)
+			.then((datasets) => {
+				if (controller.signal.aborted) return;
 				setAvailableDatasets(datasets);
 			})
 			.catch((error) => {
-				if (cancelled) return;
-				console.error("Failed to fetch tags or datasets:", error);
+				if (controller.signal.aborted) return;
+				console.error("Failed to fetch datasets:", error);
 			});
 
 		return () => {
-			cancelled = true;
+			controller.abort();
 		};
 	}, []);
 
@@ -204,6 +209,9 @@ export default function QuestionsExplorer({
 			const filterChanged =
 				prev.status !== appliedFilter.status ||
 				prev.dataset !== appliedFilter.dataset ||
+				prev.itemId !== appliedFilter.itemId ||
+				prev.refUrl !== appliedFilter.refUrl ||
+				prev.keyword !== appliedFilter.keyword ||
 				prev.sortColumn !== appliedFilter.sortColumn ||
 				prev.sortDirection !== appliedFilter.sortDirection ||
 				JSON.stringify(prev.tags) !== JSON.stringify(appliedFilter.tags);
@@ -225,18 +233,23 @@ export default function QuestionsExplorer({
 			return;
 		}
 
-		let cancelled = false;
+		const controller = new AbortController();
+		const requestId = ++listRequestIdRef.current;
 		setIsLoading(true);
+		setLoadError(null);
 		// Clear previous items when starting a new fetch to avoid showing stale data
 		setFetchedItems([]);
 
 		// Build API parameters from applied filters
+		// Note: toolCallCount is a client-side sort only (not passed to API)
 		const sortByParam =
 			appliedFilter.sortColumn === "refs"
 				? "totalReferences"
 				: appliedFilter.sortColumn === "tagCount"
 					? "tagCount"
-					: appliedFilter.sortColumn;
+					: appliedFilter.sortColumn === "toolCallCount"
+						? null // client-side sort; do not pass to backend
+						: appliedFilter.sortColumn;
 
 		// Ensure page is at least 1
 		const safePage = Math.max(1, currentPage);
@@ -262,15 +275,25 @@ export default function QuestionsExplorer({
 			limit: itemsPerPage,
 		};
 
-		listAllGroundTruths(params)
+		listAllGroundTruths(params, controller.signal)
 			.then(({ items: loadedItems, pagination: paginationData }) => {
-				if (cancelled) return;
+				if (
+					controller.signal.aborted ||
+					requestId !== listRequestIdRef.current
+				) {
+					return;
+				}
 				setFetchedItems(loadedItems);
 				setPagination(paginationData);
 				setLoadError(null);
 			})
 			.catch((error) => {
-				if (cancelled) return;
+				if (
+					controller.signal.aborted ||
+					requestId !== listRequestIdRef.current
+				) {
+					return;
+				}
 				const message =
 					error instanceof Error
 						? error.message
@@ -278,12 +301,17 @@ export default function QuestionsExplorer({
 				setLoadError(message);
 			})
 			.finally(() => {
-				if (cancelled) return;
+				if (
+					controller.signal.aborted ||
+					requestId !== listRequestIdRef.current
+				) {
+					return;
+				}
 				setIsLoading(false);
 			});
 
 		return () => {
-			cancelled = true;
+			controller.abort();
 		};
 	}, [items, appliedFilter, currentPage, itemsPerPage]);
 
@@ -291,9 +319,20 @@ export default function QuestionsExplorer({
 	const totalItemsCount = pagination?.total ?? sourceItems.length;
 
 	const displayItems = useMemo(() => {
-		// Server handles all sorting now, no client-side sorting needed
+		// Client-side sort for toolCallCount (not supported by backend API)
+		if (appliedFilter.sortColumn === "toolCallCount") {
+			const sorted = [...sourceItems].sort((a, b) => {
+				const countA = a.toolCalls?.length ?? 0;
+				const countB = b.toolCalls?.length ?? 0;
+				return appliedFilter.sortDirection === "desc"
+					? countB - countA
+					: countA - countB;
+			});
+			return sorted;
+		}
+		// Server handles all other sorting
 		return sourceItems;
-	}, [sourceItems]);
+	}, [sourceItems, appliedFilter.sortColumn, appliedFilter.sortDirection]);
 
 	const handleFilterChange = (filter: FilterType) => {
 		setActiveFilter(filter);
@@ -343,12 +382,12 @@ export default function QuestionsExplorer({
 			sortDirection,
 		};
 
+		setCurrentPage(1);
 		setAppliedFilter(newFilter);
-		// Page reset is handled by useEffect that watches appliedFilter
 	};
 
 	const handleSort = (
-		column: "refs" | "reviewedAt" | "hasAnswer" | "tagCount",
+		column: "refs" | "reviewedAt" | "hasAnswer" | "tagCount" | "toolCallCount",
 	) => {
 		if (sortColumn === column) {
 			// If already sorting by this column, toggle direction
@@ -453,7 +492,7 @@ export default function QuestionsExplorer({
 					<div className="min-w-0">
 						<div className="flex items-center gap-2 mb-2">
 							<label
-								htmlFor="itemIdFilter"
+								htmlFor={itemIdFilterId}
 								className="text-base font-semibold text-slate-800"
 							>
 								Item ID:
@@ -526,7 +565,7 @@ export default function QuestionsExplorer({
 						</div>
 						<div className="flex flex-wrap items-center gap-2">
 							<input
-								id={useId()}
+								id={itemIdFilterId}
 								type="text"
 								value={itemIdFilter}
 								onChange={(e) => setItemIdFilter(e.target.value)}
@@ -550,7 +589,7 @@ export default function QuestionsExplorer({
 					<div className="min-w-0">
 						<div className="flex items-center gap-2 mb-2">
 							<label
-								htmlFor="referenceUrlFilter"
+								htmlFor={referenceUrlFilterId}
 								className="text-base font-semibold text-slate-800"
 							>
 								Reference URL:
@@ -607,7 +646,7 @@ export default function QuestionsExplorer({
 						</div>
 						<div className="flex flex-wrap items-center gap-2">
 							<input
-								id={useId()}
+								id={referenceUrlFilterId}
 								type="text"
 								value={referenceUrlFilter}
 								onChange={(e) => setReferenceUrlFilter(e.target.value)}
@@ -632,7 +671,7 @@ export default function QuestionsExplorer({
 					<div className="min-w-0">
 						<div className="flex items-center gap-2 mb-2">
 							<label
-								htmlFor="keywordFilter"
+								htmlFor={keywordFilterId}
 								className="text-base font-semibold text-slate-800"
 							>
 								Keyword Search:
@@ -640,7 +679,7 @@ export default function QuestionsExplorer({
 						</div>
 						<div className="flex flex-wrap items-center gap-2">
 							<input
-								id={useId()}
+								id={keywordFilterId}
 								type="text"
 								value={keywordFilter}
 								onChange={(e) => setKeywordFilter(e.target.value)}
@@ -905,7 +944,13 @@ export default function QuestionsExplorer({
 									</th>
 									<th className="px-3 py-3 text-left min-w-[60px]">Status</th>
 									<th className="px-3 py-3 text-left min-w-[160px] sm:min-w-[200px]">
-										Question
+										Question / Message
+									</th>
+									<th
+										className="px-3 py-3 text-center min-w-[50px] hidden lg:table-cell"
+										title="Number of conversation turns"
+									>
+										Turns
 									</th>
 									<th className="px-3 py-3 text-left min-w-[120px] hidden lg:table-cell">
 										Tags
@@ -996,6 +1041,41 @@ export default function QuestionsExplorer({
 												)}
 										</button>
 									</th>
+									{/* Tool Calls column – generic evidence indicator */}
+									<th className="px-3 py-3 text-center min-w-[60px] hidden xl:table-cell">
+										<button
+											type="button"
+											onClick={() => handleSort("toolCallCount")}
+											className="inline-flex items-center gap-1 transition-colors hover:text-violet-700 w-full justify-center"
+											aria-label="Sort by Tool Call Count"
+											title="Number of tool calls captured in the trace"
+										>
+											Tools
+											{appliedFilter.sortColumn === "toolCallCount" && (
+												<span className="text-violet-600">
+													{appliedFilter.sortDirection === "desc" ? "↓" : "↑"}
+												</span>
+											)}
+											{sortColumn === "toolCallCount" &&
+												sortColumn !== appliedFilter.sortColumn && (
+													<span className="text-amber-500 opacity-50">
+														{sortDirection === "desc" ? "↓" : "↑"}
+													</span>
+												)}
+										</button>
+									</th>
+									{/* Plugin-contributed columns (headers) */}
+									{getExplorerExtensions().flatMap((ext) =>
+										(ext.columns ?? []).map((col) => (
+											<th
+												key={col.key}
+												className="px-3 py-3 text-center hidden xl:table-cell"
+												style={col.width ? { minWidth: col.width } : undefined}
+											>
+												{col.header}
+											</th>
+										)),
+									)}
 									<th className="px-3 py-3 text-right min-w-[140px] sm:min-w-[180px] lg:min-w-[240px]">
 										Actions
 									</th>
@@ -1005,7 +1085,7 @@ export default function QuestionsExplorer({
 								{loadError ? (
 									<tr>
 										<td
-											colSpan={9}
+											colSpan={11}
 											className="px-4 py-8 text-center text-sm text-rose-600"
 										>
 											Failed to load items: {loadError}
@@ -1014,7 +1094,7 @@ export default function QuestionsExplorer({
 								) : isLoading && sourceItems.length === 0 ? (
 									<tr>
 										<td
-											colSpan={9}
+											colSpan={11}
 											className="px-4 py-8 text-center text-sm text-slate-500"
 										>
 											Loading ground truths…
@@ -1023,7 +1103,7 @@ export default function QuestionsExplorer({
 								) : displayItems.length === 0 ? (
 									<tr>
 										<td
-											colSpan={9}
+											colSpan={11}
 											className="px-4 py-8 text-center text-sm text-slate-500"
 										>
 											No items to display
@@ -1069,15 +1149,19 @@ export default function QuestionsExplorer({
 													)}
 												</div>
 											</td>
-											{/* Question */}
+											{/* Question / first user message */}
 											<td className="px-3 py-3 text-sm">
 												<div
 													className="truncate font-medium text-slate-800 max-w-[180px] sm:max-w-[240px] lg:max-w-[300px]"
-													title={item.question}
+													title={getQueuePreview(item)}
 												>
-													{item.question || "(no question)"}
+													{getQueuePreview(item)}
 												</div>
 											</td>
+											{/* Turns count */}
+											<td className="px-3 py-3 text-xs text-center hidden lg:table-cell text-slate-500">
+												{item.history?.length ? item.history.length : "—"}
+											</td>
 											{/* Tags */}
 											<td className="px-3 py-3 hidden lg:table-cell">
 												{(item.manualTags && item.manualTags.length > 0) ||
@@ -1140,7 +1224,7 @@ export default function QuestionsExplorer({
 											</td>
 											{/* Has Answer */}
 											<td className="px-3 py-3 text-center text-sm font-medium hidden xl:table-cell">
-												{item.answer && item.answer.trim().length > 0 ? (
+												{getLastAgentTurn(item).trim().length > 0 ? (
 													<span className="text-emerald-700">Yes</span>
 												) : (
 													<span className="text-slate-400">No</span>
@@ -1166,6 +1250,44 @@ export default function QuestionsExplorer({
 														)
 													: "-"}
 											</td>
+											{/* Tool Calls (generic evidence indicator) */}
+											<td className="px-3 py-3 text-center text-sm font-medium text-slate-700 hidden xl:table-cell">
+												{(item.toolCalls?.length ?? 0) > 0 ? (
+													<span
+														className="rounded-full bg-violet-100 px-2 py-0.5 text-xs font-medium text-violet-800"
+														title={`${item.toolCalls?.length} tool call(s) captured`}
+													>
+														{item.toolCalls?.length}
+													</span>
+												) : item.expectedTools ? (
+													<span
+														className="text-xs text-amber-600"
+														title="Expected tools defined but no tool calls recorded"
+													>
+														⚠
+													</span>
+												) : (
+													<span className="text-xs text-slate-400">—</span>
+												)}
+											</td>
+											{/* Plugin-contributed columns (cells) */}
+											{getExplorerExtensions().flatMap((ext) =>
+												(ext.columns ?? []).map((col) => (
+													<td
+														key={col.key}
+														className="px-3 py-3 text-center text-sm font-medium text-slate-700 hidden xl:table-cell"
+														style={
+															col.width ? { minWidth: col.width } : undefined
+														}
+													>
+														{col.cellRenderer ? (
+															<col.cellRenderer item={item} />
+														) : (
+															(col.getValue(item) ?? "—")
+														)}
+													</td>
+												)),
+											)}
 											{/* Actions */}
 											<td className="px-3 py-3">
 												<div className="flex flex-wrap items-center justify-end gap-2">
diff --git a/frontend/src/components/app/QueueSidebar.tsx b/frontend/src/components/app/QueueSidebar.tsx
index 835236c..e9cf408 100644
--- a/frontend/src/components/app/QueueSidebar.tsx
+++ b/frontend/src/components/app/QueueSidebar.tsx
@@ -1,6 +1,7 @@
 import { CircleAlert, Clipboard, RefreshCw } from "lucide-react";
 import { useEffect, useId, useMemo, useRef, useState } from "react";
 import type { GroundTruthItem } from "../../models/groundTruth";
+import { getQueuePreview } from "../../models/groundTruth";
 import { cn } from "../../models/utils";
 
 type Props = {
@@ -197,9 +198,9 @@ export default function QueueSidebar({
 							</div>
 							<div
 								className="truncate text-xs text-slate-600"
-								title={it.question}
+								title={getQueuePreview(it)}
 							>
-								{it.question}
+								{getQueuePreview(it)}
 							</div>
 						</div>
 					))}
diff --git a/frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx b/frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx
index 1071a29..0311c16 100644
--- a/frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx
+++ b/frontend/src/components/app/ReferencesPanel/ReferencesTabs.tsx
@@ -51,10 +51,12 @@ export default function ReferencesTabs({
 	isMultiTurn = false,
 	readOnly = false,
 }: Props) {
-	// Phase 2: In multi-turn mode, disable search tab and force 'selected' view
-	if (isMultiTurn && rightTab !== "selected") {
-		setRightTab("selected");
-	}
+	useEffect(() => {
+		if (isMultiTurn && rightTab !== "selected") {
+			setRightTab("selected");
+		}
+	}, [isMultiTurn, rightTab, setRightTab]);
+
 	useEffect(() => {
 		function onKeyDown(e: KeyboardEvent) {
 			const target = e.target as HTMLElement | null;
@@ -63,7 +65,7 @@ export default function ReferencesTabs({
 				return;
 			const isMod = e.metaKey || e.ctrlKey;
 			if (!isMod) return;
-			if (e.key === "1") {
+			if (e.key === "1" && !isMultiTurn) {
 				e.preventDefault();
 				setRightTab("search");
 			} else if (e.key === "2") {
@@ -73,7 +75,7 @@ export default function ReferencesTabs({
 		}
 		window.addEventListener("keydown", onKeyDown);
 		return () => window.removeEventListener("keydown", onKeyDown);
-	}, [setRightTab]);
+	}, [isMultiTurn, setRightTab]);
 	return (
 		<aside
 			className={cn(
diff --git a/frontend/src/components/app/ReferencesPanel/SelectedTab.tsx b/frontend/src/components/app/ReferencesPanel/SelectedTab.tsx
index b83389f..78ba334 100644
--- a/frontend/src/components/app/ReferencesPanel/SelectedTab.tsx
+++ b/frontend/src/components/app/ReferencesPanel/SelectedTab.tsx
@@ -1,7 +1,8 @@
 import { ExternalLink, Trash2 } from "lucide-react";
 import type { Reference } from "../../../models/groundTruth";
 import { normalizeUrl, urlToTitle } from "../../../models/utils";
-import { getCachedConfig } from "../../../services/runtimeConfig";
+import { getReferenceApprovalRequirements } from "../../../models/validators";
+import { useRuntimeConfig } from "../../../services/runtimeConfig";
 
 type Props = {
 	references: Reference[];
@@ -18,9 +19,11 @@ export default function SelectedTab({
 	onOpenReference,
 	readOnly = false,
 }: Props) {
-	const config = getCachedConfig();
-	const requireVisit = config?.requireReferenceVisit ?? true;
-	const requireKeyPara = config?.requireKeyParagraph ?? false;
+	const runtimeConfig = useRuntimeConfig();
+	const {
+		requireReferenceVisit: requireVisit,
+		requireKeyParagraph: requireKeyPara,
+	} = getReferenceApprovalRequirements(runtimeConfig);
 
 	return (
 		<div className="flex h-full flex-col p-4">
@@ -172,7 +175,8 @@ export default function SelectedTab({
 				})}
 				{references.length === 0 && (
 					<div className="rounded-lg border bg-white p-3 text-xs text-slate-600">
-						No references added yet. Use the Search tab to add references.
+						No evidence attached yet. Add sources from the search surface or
+						from a plugin-owned workflow panel.
 					</div>
 				)}
 			</div>
diff --git a/frontend/src/components/app/StatsView.tsx b/frontend/src/components/app/StatsView.tsx
index 7a566b8..0a7cea1 100644
--- a/frontend/src/components/app/StatsView.tsx
+++ b/frontend/src/components/app/StatsView.tsx
@@ -1,4 +1,5 @@
 import type { GroundTruthItem } from "../../models/groundTruth";
+import { hasEvidenceData } from "../../models/groundTruth";
 
 type SprintStats = {
 	sprint: string;
@@ -7,9 +8,17 @@ type SprintStats = {
 	deleted: number;
 };
 
+/**
+ * Generic plugin-contributed metrics that can be extended by plugin packs.
+ * Each key is a metric name and each value is a displayable number.
+ */
+export type PluginMetrics = Record<string, number>;
+
 export type StatsPayload = {
 	total: { approved: number; draft: number; deleted: number };
 	perSprint: SprintStats[];
+	/** Optional plugin-contributed metrics to display alongside core stats. */
+	pluginMetrics?: PluginMetrics;
 };
 
 export default function StatsView({
@@ -30,6 +39,33 @@ export default function StatsView({
 		{ approved: 0, draft: 0, deleted: 0 },
 	);
 
+	// Compute generic agentic metrics from current items
+	const agenticMetrics = items.reduce(
+		(acc, it) => {
+			if (it.toolCalls && it.toolCalls.length > 0) {
+				acc.itemsWithToolCalls++;
+				acc.totalToolCalls += it.toolCalls.length;
+			}
+			if (it.contextEntries && it.contextEntries.length > 0) {
+				acc.itemsWithContext++;
+			}
+			if (hasEvidenceData(it)) {
+				acc.itemsWithEvidence++;
+			}
+			if (it.expectedTools) {
+				acc.itemsWithExpectedTools++;
+			}
+			return acc;
+		},
+		{
+			itemsWithToolCalls: 0,
+			totalToolCalls: 0,
+			itemsWithContext: 0,
+			itemsWithEvidence: 0,
+			itemsWithExpectedTools: 0,
+		},
+	);
+
 	return (
 		<section className="rounded-2xl border bg-white p-4 shadow-sm">
 			<div className="mb-4">
@@ -90,6 +126,67 @@ export default function StatsView({
 					))}
 				</div>
 			</div>
+
+			{/* Generic agentic evidence metrics – derived from current items */}
+			{items.length > 0 && (
+				<div className="mt-6">
+					<div className="mb-2 text-sm font-medium">
+						Generic Evidence Metrics
+					</div>
+					<p className="mb-3 text-xs text-slate-500">
+						Computed from {items.length} loaded item
+						{items.length !== 1 ? "s" : ""}.
+					</p>
+					<div className="grid grid-cols-2 gap-3 sm:grid-cols-4">
+						<div className="rounded-xl border bg-slate-50 p-3">
+							<div className="text-xs text-slate-500">Items w/ Trace</div>
+							<div className="mt-1 text-xl font-bold text-slate-800">
+								{agenticMetrics.itemsWithEvidence}
+							</div>
+						</div>
+						<div className="rounded-xl border bg-violet-50 p-3">
+							<div className="text-xs text-slate-500">Total Tool Calls</div>
+							<div className="mt-1 text-xl font-bold text-violet-800">
+								{agenticMetrics.totalToolCalls}
+							</div>
+							<div className="mt-0.5 text-xs text-slate-400">
+								{agenticMetrics.itemsWithToolCalls} items
+							</div>
+						</div>
+						<div className="rounded-xl border bg-blue-50 p-3">
+							<div className="text-xs text-slate-500">Items w/ Context</div>
+							<div className="mt-1 text-xl font-bold text-blue-800">
+								{agenticMetrics.itemsWithContext}
+							</div>
+						</div>
+						<div className="rounded-xl border bg-amber-50 p-3">
+							<div className="text-xs text-slate-500">
+								Items w/ Expected Tools
+							</div>
+							<div className="mt-1 text-xl font-bold text-amber-800">
+								{agenticMetrics.itemsWithExpectedTools}
+							</div>
+						</div>
+					</div>
+				</div>
+			)}
+
+			{/* Plugin-contributed metrics (extensible per plugin pack) */}
+			{data.pluginMetrics && Object.keys(data.pluginMetrics).length > 0 && (
+				<div className="mt-6">
+					<div className="mb-2 text-sm font-medium">Plugin Metrics</div>
+					<div className="grid grid-cols-2 gap-3 sm:grid-cols-4">
+						{Object.entries(data.pluginMetrics).map(([key, value]) => (
+							<div key={key} className="rounded-xl border bg-slate-50 p-3">
+								<div className="text-xs text-slate-500">{key}</div>
+								<div className="mt-1 text-xl font-bold text-slate-800">
+									{value}
+								</div>
+							</div>
+						))}
+					</div>
+				</div>
+			)}
 		</section>
 	);
 }
diff --git a/frontend/src/components/app/TracePanel.tsx b/frontend/src/components/app/TracePanel.tsx
new file mode 100644
index 0000000..ae2235c
--- /dev/null
+++ b/frontend/src/components/app/TracePanel.tsx
@@ -0,0 +1,564 @@
+import { useState } from "react";
+import type {
+	ContextEntry,
+	ExpectedTools,
+	FeedbackEntry,
+	GroundTruthItem,
+	PluginPayload,
+	Reference,
+	ToolCallRecord,
+} from "../../models/groundTruth";
+import { getItemReferences, hasEvidenceData } from "../../models/groundTruth";
+import { cn } from "../../models/utils";
+import { fieldComponentRegistry } from "../../registry/FieldComponentRegistry";
+import { RegistryRenderer } from "../../registry/RegistryRenderer";
+import type { EditorProps, ViewerProps } from "../../registry/types";
+import ContextEntryEditor from "./editors/ContextEntryEditor";
+import ToolCallDetailView from "./editors/ToolCallDetailView";
+
+/** Extract executionTimeSeconds from a tool call's response (unknown type). */
+function getExecTime(response: unknown): number | null {
+	if (response && typeof response === "object") {
+		const r = response as Record<string, unknown>;
+		if (typeof r.executionTimeSeconds === "number") {
+			return r.executionTimeSeconds;
+		}
+	}
+	return null;
+}
+
+/** Derive overall sentiment from feedback values (lower = more positive). */
+function deriveSentiment(
+	feedback: FeedbackEntry[],
+): "positive" | "negative" | null {
+	const numericValues: number[] = [];
+	for (const f of feedback) {
+		for (const v of Object.values(f.values ?? {})) {
+			if (typeof v === "number") numericValues.push(v);
+		}
+	}
+	if (numericValues.length === 0) return null;
+	const avg = numericValues.reduce((a, b) => a + b, 0) / numericValues.length;
+	return avg <= 2.5 ? "positive" : "negative";
+}
+
+/** Score color: 1=green, 2=amber, 3+=red. */
+function scoreColor(score: number): string {
+	if (score <= 1) return "text-emerald-700";
+	if (score <= 2) return "text-amber-700";
+	return "text-rose-700";
+}
+
+function CollapsibleSection({
+	title,
+	badge,
+	defaultOpen = false,
+	children,
+}: {
+	title: string;
+	badge?: string | number;
+	defaultOpen?: boolean;
+	children: React.ReactNode;
+}) {
+	const [open, setOpen] = useState(defaultOpen);
+	return (
+		<div className="rounded-xl border border-slate-200 bg-white shadow-sm">
+			<button
+				type="button"
+				className="flex w-full items-center justify-between rounded-xl px-4 py-3 text-left hover:bg-slate-50 select-none"
+				onClick={() => setOpen((v) => !v)}
+				aria-expanded={open}
+			>
+				<span className="flex items-center gap-2 text-sm font-medium text-slate-700">
+					{title}
+					{badge !== undefined && (
+						<span className="rounded-full bg-slate-100 px-2 py-0.5 text-xs text-slate-600">
+							{badge}
+						</span>
+					)}
+				</span>
+				<span className="text-xs text-slate-400">{open ? "▾" : "▸"}</span>
+			</button>
+			{open && <div className="border-t px-4 pb-4 pt-3">{children}</div>}
+		</div>
+	);
+}
+
+function TraceInfoSection({ item }: { item: GroundTruthItem }) {
+	const traceIds = item.traceIds ?? {};
+	const contextEntries = item.contextEntries ?? [];
+	const tracePayload = item.tracePayload ?? {};
+
+	const entries: [string, string][] = [];
+	for (const [k, v] of Object.entries(traceIds)) {
+		const display =
+			typeof v === "string" && v.length > 20
+				? `${v.substring(0, 18)}…`
+				: String(v);
+		entries.push([k, display]);
+	}
+
+	const contextMap = new Map(contextEntries.map((e) => [e.key, e]));
+	if (contextMap.has("impacted_device_type")) {
+		const deviceType = String(
+			contextMap.get("impacted_device_type")?.value ?? "",
+		);
+		const device = contextMap.has("impacted_device")
+			? String(contextMap.get("impacted_device")?.value ?? "")
+			: "";
+		entries.push(["device", `${deviceType} ${device}`.trim()]);
+	}
+
+	const sentiment = deriveSentiment(item.feedback ?? []);
+	if (sentiment) {
+		entries.push(["feedback", sentiment === "positive" ? "like" : "dislike"]);
+	}
+
+	const resolution =
+		tracePayload.resolution ?? contextMap.get("resolution")?.value;
+	if (resolution) {
+		entries.push(["resolution", String(resolution)]);
+	}
+
+	if (entries.length === 0) return null;
+
+	const wideKeys = new Set(["resolution"]);
+	const normalEntries = entries.filter(([k]) => !wideKeys.has(k));
+	const wideEntries = entries.filter(([k]) => wideKeys.has(k));
+
+	return (
+		<div className="space-y-1 rounded-xl border border-slate-200 bg-slate-50 p-3">
+			<div className="mb-1 text-xs font-semibold uppercase tracking-wide text-slate-600">
+				Trace Info
+			</div>
+			<div className="grid grid-cols-2 gap-x-4 gap-y-1 text-xs">
+				{normalEntries.map(([k, v]) => (
+					<div key={k}>
+						<span className="font-mono text-slate-500">{k}: </span>
+						<span className="font-mono text-slate-700">{v}</span>
+					</div>
+				))}
+				{wideEntries.map(([k, v]) => (
+					<div key={k} className="col-span-2">
+						<span className="font-mono text-slate-500">{k}: </span>
+						<span className="text-slate-700">{v}</span>
+					</div>
+				))}
+			</div>
+		</div>
+	);
+}
+
+type ToolCallRenderData = {
+	toolCall: ToolCallRecord;
+	index: number;
+	item: GroundTruthItem;
+	expectedTools?: ExpectedTools;
+	references: Reference[];
+	onAddReferences?: (refs: Reference[]) => void;
+	onOpenReference?: (ref: Reference) => void;
+	onUpdateReference?: (refId: string, partial: Partial<Reference>) => void;
+	onRemoveReference?: (refId: string) => void;
+};
+
+type PluginRenderData = {
+	slot: string;
+	payload: PluginPayload;
+};
+
+function ToolCallViewer({ data }: ViewerProps) {
+	const value = data as ToolCallRenderData;
+	return (
+		<ToolCallDetailView
+			tc={value.toolCall}
+			index={value.index}
+			item={value.item}
+			expectedTools={value.expectedTools}
+			references={value.references}
+			onAddReferences={value.onAddReferences}
+			onOpenReference={value.onOpenReference}
+			onUpdateReference={value.onUpdateReference}
+			onRemoveReference={value.onRemoveReference}
+		/>
+	);
+}
+
+function ToolCallEditor({ data, onChange }: EditorProps) {
+	const value = data as ToolCallRenderData;
+	return (
+		<ToolCallDetailView
+			tc={value.toolCall}
+			index={value.index}
+			item={value.item}
+			expectedTools={value.expectedTools}
+			references={value.references}
+			onAddReferences={value.onAddReferences}
+			onOpenReference={value.onOpenReference}
+			onUpdateReference={value.onUpdateReference}
+			onRemoveReference={value.onRemoveReference}
+			onUpdateExpectedTools={(expectedTools) =>
+				onChange({ ...value, expectedTools })
+			}
+		/>
+	);
+}
+
+function ContextEntriesViewer({ data }: ViewerProps) {
+	const entries = (data as ContextEntry[]) ?? [];
+	if (!entries.length) {
+		return (
+			<div className="text-xs italic text-slate-400">
+				No context entries provided.
+			</div>
+		);
+	}
+	return (
+		<div className="space-y-2">
+			{entries.map((entry) => (
+				<div
+					key={`${entry.key}-${JSON.stringify(entry.value)}`}
+					className="rounded-lg border border-slate-200 bg-slate-50 p-3"
+				>
+					<div className="flex items-start gap-2 text-xs">
+						<span className="shrink-0 font-mono text-slate-500">
+							{entry.key}:
+						</span>
+						<span className="break-all text-slate-700">
+							{typeof entry.value === "object"
+								? JSON.stringify(entry.value)
+								: String(entry.value)}
+						</span>
+					</div>
+				</div>
+			))}
+		</div>
+	);
+}
+
+function ContextEntriesEditor({ data, onChange }: EditorProps) {
+	return (
+		<ContextEntryEditor
+			entries={(data as ContextEntry[]) ?? []}
+			onUpdate={(entries) => onChange(entries)}
+		/>
+	);
+}
+
+function FeedbackViewer({ data }: ViewerProps) {
+	const feedback = (data as FeedbackEntry[]) ?? [];
+	const allValues: [string, number][] = [];
+	for (const entry of feedback) {
+		for (const [k, v] of Object.entries(entry.values ?? {})) {
+			if (typeof v === "number") allValues.push([k, v]);
+		}
+	}
+	if (!allValues.length) return null;
+	return (
+		<div className="rounded-xl border border-slate-200 bg-slate-50 p-3">
+			<div className="mb-1 text-xs font-semibold uppercase tracking-wide text-slate-600">
+				Feedback Scores
+			</div>
+			{allValues.map(([question, score]) => (
+				<div
+					key={question}
+					className="flex items-center justify-between py-0.5 text-xs"
+				>
+					<span className="mr-2 text-slate-600">{question}</span>
+					<span className={cn("font-medium", scoreColor(score))}>{score}</span>
+				</div>
+			))}
+			<p className="mt-1 text-xs italic text-slate-400">
+				Scale: 1 = Strongly Agree, 5 = Strongly Disagree
+			</p>
+		</div>
+	);
+}
+
+function PluginPayloadViewer({ data }: ViewerProps) {
+	const { slot, payload } = data as PluginRenderData;
+	const hasData = Object.keys(payload.data ?? {}).length > 0;
+	return (
+		<div className="space-y-2 rounded-lg border border-slate-200 bg-slate-50 p-3">
+			<div className="flex flex-wrap items-center gap-2">
+				<span className="text-sm font-medium text-slate-800">{slot}</span>
+				<span className="rounded-full bg-violet-100 px-2 py-0.5 text-xs text-violet-800">
+					{payload.kind}
+				</span>
+				<span className="rounded-full bg-slate-200 px-2 py-0.5 text-xs text-slate-600">
+					v{payload.version}
+				</span>
+			</div>
+			{hasData ? (
+				<pre className="overflow-auto rounded-md bg-slate-100 p-2 text-xs text-slate-700">
+					{JSON.stringify(payload.data, null, 2)}
+				</pre>
+			) : (
+				<div className="text-xs italic text-slate-400">
+					No plugin-owned fields provided.
+				</div>
+			)}
+		</div>
+	);
+}
+
+function JsonBlockViewer({ data }: ViewerProps) {
+	return (
+		<pre className="max-h-64 overflow-auto rounded-md bg-slate-100 p-2 text-xs text-slate-700">
+			{JSON.stringify(data, null, 2)}
+		</pre>
+	);
+}
+
+fieldComponentRegistry.registerIfAbsent({
+	discriminator: "toolCall",
+	viewer: ToolCallViewer,
+	editor: ToolCallEditor,
+	displayName: "Tool Call",
+});
+fieldComponentRegistry.registerIfAbsent({
+	discriminator: "contextEntries",
+	viewer: ContextEntriesViewer,
+	editor: ContextEntriesEditor,
+	displayName: "Context Entries",
+});
+fieldComponentRegistry.registerIfAbsent({
+	discriminator: "feedback",
+	viewer: FeedbackViewer,
+	displayName: "Feedback Scores",
+});
+fieldComponentRegistry.registerIfAbsent({
+	discriminator: "pluginPayload",
+	viewer: PluginPayloadViewer,
+	displayName: "Plugin Payload",
+});
+fieldComponentRegistry.registerIfAbsent({
+	discriminator: "tracePayload",
+	viewer: JsonBlockViewer,
+	displayName: "Trace Payload",
+});
+fieldComponentRegistry.registerIfAbsent({
+	discriminator: "metadata",
+	viewer: JsonBlockViewer,
+	displayName: "Metadata",
+});
+
+export { getExecTime };
+
+export default function TracePanel({
+	item,
+	className,
+	onUpdateContextEntries,
+	onUpdateExpectedTools,
+	onAddReferences,
+	onOpenReference,
+	onUpdateReference,
+	onRemoveReference,
+}: {
+	item: GroundTruthItem;
+	className?: string;
+	onUpdateContextEntries?: (entries: ContextEntry[]) => void;
+	onUpdateExpectedTools?: (tools: ExpectedTools) => void;
+	onAddReferences?: (refs: Reference[]) => void;
+	onOpenReference?: (ref: Reference) => void;
+	onUpdateReference?: (refId: string, partial: Partial<Reference>) => void;
+	onRemoveReference?: (refId: string) => void;
+}) {
+	const [expanded, setExpanded] = useState(true);
+
+	if (!hasEvidenceData(item)) {
+		return (
+			<div
+				className={cn(
+					"rounded-2xl border border-slate-200 bg-slate-50 p-4 text-center text-sm text-slate-400",
+					className,
+				)}
+			>
+				No trace or evidence data available for this item.
+			</div>
+		);
+	}
+
+	const toolCalls = item.toolCalls ?? [];
+	const references = getItemReferences(item);
+	const contextEntries = item.contextEntries ?? [];
+	const metadata = item.metadata ?? {};
+	const plugins = item.plugins ?? {};
+	const feedback = item.feedback ?? [];
+	const tracePayload = item.tracePayload ?? {};
+	const sentiment = deriveSentiment(feedback);
+	const hasMoreDetails =
+		contextEntries.length > 0 ||
+		!!onUpdateContextEntries ||
+		Object.keys(metadata).length > 0 ||
+		Object.entries(plugins).length > 0 ||
+		Object.keys(tracePayload).length > 0;
+
+	return (
+		<div className={cn("rounded-2xl border bg-white shadow-sm", className)}>
+			<button
+				type="button"
+				className="flex w-full items-center justify-between rounded-2xl p-4 cursor-pointer hover:bg-slate-50 select-none"
+				onClick={() => setExpanded((v) => !v)}
+				aria-expanded={expanded}
+			>
+				<div className="flex flex-wrap items-center gap-2">
+					<span className="text-sm font-medium text-slate-700">
+						Evidence & Review ({toolCalls.length} tool call
+						{toolCalls.length !== 1 ? "s" : ""})
+					</span>
+					{sentiment && (
+						<span
+							className={cn(
+								"rounded-full px-2 py-0.5 text-xs font-medium",
+								sentiment === "positive"
+									? "bg-emerald-100 text-emerald-800"
+									: "bg-rose-100 text-rose-800",
+							)}
+						>
+							{sentiment === "positive" ? "👍 Positive" : "👎 Negative"}
+						</span>
+					)}
+				</div>
+				<span className="text-xs text-slate-500">
+					{expanded ? "▾ Collapse" : "▸ Expand"}
+				</span>
+			</button>
+
+			{expanded && (
+				<div className="space-y-3 border-t px-4 pb-4">
+					<div className="mt-3">
+						<TraceInfoSection item={item} />
+					</div>
+
+					<RegistryRenderer
+						discriminator="feedback:scores"
+						data={feedback}
+						context={{ itemId: item.id, fieldPath: "feedback", readOnly: true }}
+						mode="viewer"
+					/>
+
+					{toolCalls.length > 0 && (
+						<>
+							<div className="mt-2 text-xs font-semibold uppercase tracking-wide text-slate-600">
+								Tool Calls ({toolCalls.length})
+							</div>
+							{toolCalls.map((tc, i) => (
+								<RegistryRenderer
+									key={tc.id || String(i)}
+									discriminator={`toolCall:${tc.name}`}
+									data={{
+										toolCall: tc,
+										index: i,
+										item,
+										expectedTools: item.expectedTools,
+										references,
+										onAddReferences,
+										onOpenReference,
+										onUpdateReference,
+										onRemoveReference,
+									}}
+									context={{
+										itemId: item.id,
+										fieldPath: `toolCalls.${i}`,
+										readOnly: !onUpdateExpectedTools,
+									}}
+									mode={onUpdateExpectedTools ? "editor" : "viewer"}
+									onChange={(next) => {
+										if (!onUpdateExpectedTools) return;
+										onUpdateExpectedTools(
+											(next as ToolCallRenderData).expectedTools ?? {
+												required: [],
+											},
+										);
+									}}
+								/>
+							))}
+						</>
+					)}
+
+					{hasMoreDetails && (
+						<CollapsibleSection title="More Details">
+							<div className="space-y-3">
+								{(contextEntries.length > 0 || onUpdateContextEntries) && (
+									<CollapsibleSection
+										title="Context Entries"
+										badge={contextEntries.length}
+										defaultOpen
+									>
+										<RegistryRenderer
+											discriminator="contextEntries:batch"
+											data={contextEntries}
+											context={{
+												itemId: item.id,
+												fieldPath: "contextEntries",
+												readOnly: !onUpdateContextEntries,
+											}}
+											mode={onUpdateContextEntries ? "editor" : "viewer"}
+											onChange={(next) =>
+												onUpdateContextEntries?.(next as ContextEntry[])
+											}
+										/>
+									</CollapsibleSection>
+								)}
+
+								{Object.keys(metadata).length > 0 && (
+									<CollapsibleSection title="Metadata">
+										<RegistryRenderer
+											discriminator="metadata"
+											data={metadata}
+											context={{
+												itemId: item.id,
+												fieldPath: "metadata",
+												readOnly: true,
+											}}
+											mode="viewer"
+										/>
+									</CollapsibleSection>
+								)}
+
+								{Object.entries(plugins).length > 0 && (
+									<CollapsibleSection
+										title="Plugin Details"
+										badge={Object.keys(plugins).length}
+									>
+										<div className="space-y-2">
+											{Object.entries(plugins).map(([slot, payload]) => (
+												<RegistryRenderer
+													key={slot}
+													discriminator={`pluginPayload:${payload.kind}`}
+													data={{ slot, payload }}
+													context={{
+														itemId: item.id,
+														fieldPath: `plugins.${slot}`,
+														pluginKind: payload.kind,
+														readOnly: true,
+													}}
+													mode="viewer"
+												/>
+											))}
+										</div>
+									</CollapsibleSection>
+								)}
+
+								{Object.keys(tracePayload).length > 0 && (
+									<CollapsibleSection title="Trace Payload">
+										<RegistryRenderer
+											discriminator="tracePayload"
+											data={tracePayload}
+											context={{
+												itemId: item.id,
+												fieldPath: "tracePayload",
+												readOnly: true,
+											}}
+											mode="viewer"
+										/>
+									</CollapsibleSection>
+								)}
+							</div>
+						</CollapsibleSection>
+					)}
+				</div>
+			)}
+		</div>
+	);
+}
diff --git a/frontend/src/components/app/editor/ConversationTurn.tsx b/frontend/src/components/app/editor/ConversationTurn.tsx
index e569ab1..3bb70c8 100644
--- a/frontend/src/components/app/editor/ConversationTurn.tsx
+++ b/frontend/src/components/app/editor/ConversationTurn.tsx
@@ -1,35 +1,23 @@
 import {
-	Bot,
 	Check,
 	ChevronDown,
 	ChevronRight,
 	Edit2,
-	Loader2,
-	Paperclip,
 	Trash2,
 	X,
 } from "lucide-react";
 import { useEffect, useRef, useState } from "react";
-import type {
-	ConversationTurn,
-	ExpectedBehavior,
-} from "../../../models/groundTruth";
+import type { ConversationTurn } from "../../../models/groundTruth";
 import { cn } from "../../../models/utils";
 import MarkdownRenderer from "../../common/MarkdownRenderer";
-import ExpectedBehaviorSelector from "./ExpectedBehaviorSelector";
 
 type Props = {
 	turn: ConversationTurn;
 	index: number;
 	isLast: boolean;
 	onUpdate: (content: string) => void;
-	onUpdateExpectedBehavior?: (behaviors: ExpectedBehavior[]) => void; // only for agent turns
 	onDelete: () => void;
-	onRegenerate?: () => void; // only for agent turns - regenerates with full agent (tools + search)
 	canEdit?: boolean;
-	isGenerating?: boolean;
-	referenceCount?: number;
-	onViewReferences?: () => void;
 };
 
 export default function ConversationTurnComponent({
@@ -37,24 +25,18 @@ export default function ConversationTurnComponent({
 	index,
 	isLast,
 	onUpdate,
-	onUpdateExpectedBehavior,
 	onDelete,
-	onRegenerate,
 	canEdit = true,
-	isGenerating = false,
-	referenceCount = 0,
-	onViewReferences,
 }: Props) {
 	const [isEditing, setIsEditing] = useState(false);
 	const [isCollapsed, setIsCollapsed] = useState(false);
 	const [editContent, setEditContent] = useState(turn.content);
-	const controlsEnabled = canEdit && !isGenerating;
 
 	// Calculate pair-based turn number (each user+agent pair is one turn)
 	const turnNumber = Math.floor(index / 2) + 1;
 
 	const toggleEditMode = () => {
-		if (!controlsEnabled) return;
+		if (!canEdit) return;
 		if (isEditing) {
 			setEditContent(turn.content); // reset on cancel
 		}
@@ -64,19 +46,26 @@ export default function ConversationTurnComponent({
 	};
 
 	const handleSave = () => {
-		if (!controlsEnabled) return;
+		if (!canEdit) return;
 		onUpdate(editContent);
 		setIsEditing(false);
 	};
 
 	const handleCancel = () => {
-		if (!controlsEnabled) return;
+		if (!canEdit) return;
 		setEditContent(turn.content);
 		setIsEditing(false);
 	};
 
 	const isUser = turn.role === "user";
-	const isAgent = turn.role === "agent";
+	const isAgent = turn.role !== "user";
+	const stableTurnIdentity = turn.stepId || turn.turnId;
+	// Display label: "Agent" for generic "agent" role, capitalize for others
+	const roleLabel = isUser
+		? "User"
+		: turn.role === "agent"
+			? "Agent"
+			: turn.role;
 
 	// Sync editContent when turn.content changes (e.g., when switching queue items)
 	useEffect(() => {
@@ -99,6 +88,7 @@ export default function ConversationTurnComponent({
 
 	return (
 		<div
+			data-turn-id={turn.turnId}
 			data-turn-index={index}
 			data-last-turn={isLast ? "true" : "false"}
 			className={cn(
@@ -106,7 +96,6 @@ export default function ConversationTurnComponent({
 				isUser && "border-blue-200 bg-blue-50",
 				isAgent && "border-violet-200 bg-violet-50",
 			)}
-			aria-busy={isGenerating ? "true" : undefined}
 		>
 			<div className="mb-2 flex items-center justify-between">
 				<div className="flex items-center gap-2">
@@ -117,16 +106,19 @@ export default function ConversationTurnComponent({
 							isAgent && "bg-violet-500 text-white",
 						)}
 					>
-						{isUser ? "User" : "Agent"}
+						{roleLabel}
+					</span>
+					<span className="text-xs text-slate-600">
+						Turn #{turnNumber}
+						{stableTurnIdentity ? ` · ${stableTurnIdentity}` : ""}
 					</span>
-					<span className="text-xs text-slate-600">Turn #{turnNumber}</span>
 					<button
 						type="button"
 						onClick={() => setIsCollapsed((c) => !c)}
 						title={isCollapsed ? "Expand turn" : "Collapse turn"}
 						aria-expanded={!isCollapsed}
 						className="ml-1 flex items-center gap-1 rounded-lg border border-slate-200 px-2 py-1 text-xs font-medium text-slate-600 hover:bg-slate-50 disabled:cursor-not-allowed disabled:opacity-50"
-						disabled={!controlsEnabled || isEditing}
+						disabled={!canEdit || isEditing}
 					>
 						{isCollapsed ? (
 							<ChevronRight className="h-4 w-4" />
@@ -135,62 +127,15 @@ export default function ConversationTurnComponent({
 						)}
 						<span>{isCollapsed ? "Open" : "Close"}</span>
 					</button>
-					{isAgent &&
-						typeof referenceCount === "number" &&
-						referenceCount > 0 && (
-							<button
-								type="button"
-								onClick={onViewReferences}
-								className="flex items-center gap-1 rounded-full bg-violet-100 px-2 py-0.5 text-xs font-medium text-violet-700 transition-colors hover:bg-violet-200"
-								title={`View ${referenceCount} reference${referenceCount !== 1 ? "s" : ""} for this turn`}
-							>
-								<Paperclip className="h-3 w-3" />
-								{referenceCount} reference{referenceCount !== 1 ? "s" : ""}
-							</button>
-						)}
-					{isAgent &&
-						typeof referenceCount === "number" &&
-						referenceCount === 0 && (
-							<button
-								type="button"
-								onClick={onViewReferences}
-								className="flex items-center gap-1 rounded-full bg-violet-100 px-2 py-0.5 text-xs font-medium text-violet-700 transition-colors hover:bg-violet-200"
-								title="Add references for this turn"
-							>
-								<Paperclip className="h-3 w-3" />
-								Add reference
-							</button>
-						)}
-					{/** Inline spinner appears while generating */}
-					{isAgent && isGenerating && (
-						<span className="ml-1 flex items-center gap-1 rounded-lg border border-violet-200 px-2 py-1 text-xs font-medium text-violet-700">
-							<Loader2 className="h-3 w-3 animate-spin" />
-							<span>Running…</span>
-						</span>
-					)}
-					{/* Removed inline tag badges to avoid overflow */}
 				</div>
 				<div className="flex items-center gap-1">
-					{controlsEnabled && !isEditing && (
+					{canEdit && !isEditing && (
 						<>
-							{isAgent && onRegenerate && (
-								<button
-									type="button"
-									onClick={onRegenerate}
-									title="Run Agent - Performs full agent workflow with tools and search. Updates both answer and references."
-									className="flex items-center gap-1 rounded-lg border border-violet-200 px-2 py-1 text-xs font-medium text-violet-700 hover:bg-violet-100 disabled:cursor-not-allowed disabled:opacity-50"
-									disabled={isGenerating}
-								>
-									<Bot className="h-4 w-4" />
-									<span>Agent</span>
-								</button>
-							)}
 							<button
 								type="button"
 								onClick={toggleEditMode}
 								title="Edit turn"
 								className="flex items-center gap-1 rounded-lg border border-slate-200 px-2 py-1 text-xs font-medium text-slate-700 hover:bg-white disabled:cursor-not-allowed disabled:opacity-50"
-								disabled={isGenerating}
 							>
 								<Edit2 className="h-4 w-4" />
 								<span>Edit</span>
@@ -200,7 +145,6 @@ export default function ConversationTurnComponent({
 								onClick={onDelete}
 								title="Delete turn"
 								className="flex items-center gap-1 rounded-lg border border-rose-200 px-2 py-1 text-xs font-medium text-rose-700 hover:bg-rose-50 disabled:cursor-not-allowed disabled:opacity-50"
-								disabled={isGenerating}
 							>
 								<Trash2 className="h-4 w-4" />
 								<span>Delete</span>
@@ -232,50 +176,29 @@ export default function ConversationTurnComponent({
 				</div>
 			</div>
 
-			{!isCollapsed && (
-				<>
-					{/* Expected behavior selector for agent turns - moved to top */}
-					{isAgent && onUpdateExpectedBehavior && (
-						<div
-							className={cn(
-								"mb-3 rounded-lg border p-3",
-								!turn.expectedBehavior || turn.expectedBehavior.length === 0
-									? "border-rose-200 bg-rose-50/50"
-									: "border-violet-100 bg-violet-50/50",
-							)}
-						>
-							<ExpectedBehaviorSelector
-								selectedBehaviors={turn.expectedBehavior || []}
-								onChange={onUpdateExpectedBehavior}
-								disabled={!controlsEnabled || isGenerating}
-							/>
-						</div>
-					)}
-
-					{isEditing ? (
-						<textarea
-							className="w-full rounded-lg border border-slate-300 p-3 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300"
-							value={editContent}
-							onChange={(e) => setEditContent(e.target.value)}
-							rows={6}
-							disabled={isGenerating}
-							placeholder={
-								isUser
-									? "Enter user message..."
-									: "Enter agent response (Markdown supported)..."
-							}
-						/>
-					) : (
-						<MarkdownRenderer
-							content={turn.content}
-							compact
-							className={cn(
-								"rounded-lg px-1 py-0.5", // subtle padding within turn box
-							)}
-						/>
-					)}
-				</>
-			)}
+			{!isCollapsed &&
+				(isEditing ? (
+					<textarea
+						aria-label={`${roleLabel} turn ${index + 1} content`}
+						className="w-full rounded-lg border border-slate-300 p-3 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300"
+						value={editContent}
+						onChange={(e) => setEditContent(e.target.value)}
+						rows={6}
+						placeholder={
+							isUser
+								? "Enter user message..."
+								: "Enter agent response (Markdown supported)..."
+						}
+					/>
+				) : (
+					<MarkdownRenderer
+						content={turn.content}
+						compact
+						className={cn(
+							"rounded-lg px-1 py-0.5", // subtle padding within turn box
+						)}
+					/>
+				))}
 		</div>
 	);
 }
diff --git a/frontend/src/components/app/editor/MultiTurnEditor.tsx b/frontend/src/components/app/editor/MultiTurnEditor.tsx
index ad9c916..a6527b0 100644
--- a/frontend/src/components/app/editor/MultiTurnEditor.tsx
+++ b/frontend/src/components/app/editor/MultiTurnEditor.tsx
@@ -4,33 +4,23 @@ import {
 	MessageCircle,
 	UserCircle,
 } from "lucide-react";
-import { useEffect, useMemo, useState } from "react";
-import type { AgentGenerationResult } from "../../../hooks/useGroundTruth";
-import { useReferencesSearch } from "../../../hooks/useReferencesSearch";
+import { useMemo, useState } from "react";
 import useTags from "../../../hooks/useTags";
 import type {
 	ConversationTurn,
-	ExpectedBehavior,
 	GroundTruthItem,
-	Reference,
 } from "../../../models/groundTruth";
 import { validateConversationPattern } from "../../../models/validators";
 import TagChip from "../../common/TagChip";
 import ConversationTurnComponent from "./ConversationTurn";
 import TagsModal from "./TagsModal";
-import TurnReferencesModal from "./TurnReferencesModal";
 
 type Props = {
 	current: GroundTruthItem | null;
 	readOnly?: boolean;
 	onUpdateHistory: (history: ConversationTurn[]) => void;
 	onDeleteTurn: (messageIndex: number) => void;
-	onGenerate: (messageIndex: number) => Promise<AgentGenerationResult>;
 	canEdit: boolean;
-	onUpdateReference: (refId: string, partial: Partial<Reference>) => void;
-	onRemoveReference: (refId: string) => void;
-	onOpenReference: (ref: Reference) => void;
-	onAddReferences?: (refs: Reference[]) => void;
 	onUpdateTags: (tags: string[]) => void;
 };
 
@@ -39,67 +29,26 @@ export default function MultiTurnEditor({
 	readOnly = false,
 	onUpdateHistory,
 	onDeleteTurn,
-	onGenerate,
 	canEdit,
-	onUpdateReference,
-	onRemoveReference,
-	onOpenReference,
-	onAddReferences,
 	onUpdateTags,
 }: Props) {
-	const [isGenerating, setIsGenerating] = useState(false);
-	const [generatingMessageIndex, setGeneratingMessageIndex] = useState<
-		number | null
-	>(null);
-	const [agentError, setAgentError] = useState<string | null>(null);
-	const [viewingReferencesForTurn, setViewingReferencesForTurn] = useState<
-		number | null
-	>(null);
 	const [managingGroundTruthTags, setManagingGroundTruthTags] = useState(false);
 
-	// Search functionality for references modal
-	const { query, setQuery, searching, searchResults, runSearch, clearResults } =
-		useReferencesSearch({
-			getSeedQuery: () => current?.question,
-		});
-
-	useEffect(() => {
-		if (!current) {
-			setAgentError(null);
-			setIsGenerating(false);
-			setGeneratingMessageIndex(null);
-			clearResults();
-			return;
-		}
-		setAgentError(null);
-		setIsGenerating(false);
-		setGeneratingMessageIndex(null);
-	}, [current, clearResults]);
-
 	const history = current?.history || [];
-	const references = current?.references || [];
-
-	// Backend-provided global tags (cached via useTags)
-	const { allTags: availableTags, refresh: refreshTags } = useTags();
+	const shouldLoadTags = canEdit && !readOnly;
 
-	// Calculate reference counts per turn
-	const referenceCounts = useMemo(() => {
-		const counts = new Map<number, number>();
-		references.forEach((ref) => {
-			if (typeof ref.messageIndex === "number") {
-				counts.set(ref.messageIndex, (counts.get(ref.messageIndex) || 0) + 1);
-			}
-		});
-		return counts;
-	}, [references]);
+	// Backend-provided global tags (cached via the shared metadata service)
+	const { allTags: availableTags, refresh: refreshTags } = useTags({
+		enabled: shouldLoadTags,
+	});
 
-	// Determine which turn types can be added next
-	const lastTurn = history.length > 0 ? history[history.length - 1] : null;
-	const canAddUser = !lastTurn || lastTurn.role === "agent";
-	const canAddAgent = lastTurn?.role === "user";
+	// Any turn type can be added at any position (agentic workflows allow
+	// consecutive agent turns such as orchestrator → sub-agent or RCA).
+	const canAddUser = true;
+	const canAddAgent = true;
 
 	const handleAddUserTurn = () => {
-		if (!canAddUser || isGenerating) return;
+		if (!canAddUser) return;
 		const newHistory: ConversationTurn[] = [
 			...history,
 			{ role: "user", content: "" },
@@ -108,8 +57,7 @@ export default function MultiTurnEditor({
 	};
 
 	const handleAddAgentTurn = () => {
-		if (!canAddAgent || isGenerating) return;
-		// Create empty agent turn placeholder - user will manually trigger generation
+		if (!canAddAgent) return;
 		const newHistory: ConversationTurn[] = [
 			...history,
 			{ role: "agent", content: "" },
@@ -123,75 +71,15 @@ export default function MultiTurnEditor({
 		onUpdateHistory(newHistory);
 	};
 
-	const handleUpdateExpectedBehavior = (
-		index: number,
-		expectedBehavior: ExpectedBehavior[],
-	) => {
-		const newHistory = [...history];
-		newHistory[index] = { ...newHistory[index], expectedBehavior };
-		onUpdateHistory(newHistory);
-	};
-
 	const handleRemoveTurn = (index: number) => {
-		if (isGenerating) return;
-
 		const turn = history[index];
 		if (!turn) return;
 
-		// Validation: Check if deleting this turn would break conversation flow
-		// If this is a user turn and the next turn is an agent turn, warn the user
-		if (turn.role === "user" && index < history.length - 1) {
-			const nextTurn = history[index + 1];
-			if (nextTurn?.role === "agent") {
-				if (
-					!window.confirm(
-						"Deleting this user turn will also require deleting the following agent turn to maintain conversation flow. Delete both turns?",
-					)
-				) {
-					return;
-				}
-				// Delete both the user turn and the following agent turn
-				onDeleteTurn(index);
-				// After deleting index, the next turn shifts down, so delete at same index
-				onDeleteTurn(index);
-				return;
-			}
-		}
-
-		// Standard confirmation for single turn deletion
 		if (window.confirm("Are you sure you want to delete this turn?")) {
 			onDeleteTurn(index);
 		}
 	};
 
-	const handleRegenerate = async (index: number) => {
-		if (isGenerating) return;
-
-		// Confirmation check
-		if (
-			!window.confirm(
-				"Run Agent will perform a full agent workflow with tools and search. This will replace both the answer and references for this turn. Continue?",
-			)
-		) {
-			return;
-		}
-
-		setAgentError(null);
-		setIsGenerating(true);
-		setGeneratingMessageIndex(index);
-		try {
-			const result = await onGenerate(index);
-			if (!result.ok && result.error) setAgentError(result.error);
-		} catch (err) {
-			const message =
-				err instanceof Error ? err.message : "Agent request failed.";
-			setAgentError(message);
-		} finally {
-			setIsGenerating(false);
-			setGeneratingMessageIndex(null);
-		}
-	};
-
 	// Validate conversation pattern for visual feedback
 	const patternValidation = useMemo(
 		() => validateConversationPattern(history),
@@ -239,82 +127,37 @@ export default function MultiTurnEditor({
 					{history.length === 0 ? (
 						<div className="rounded-xl border border-slate-200 bg-slate-50 p-6 text-center">
 							<p className="text-sm text-slate-600">
-								No conversation turns yet. Start by adding a user turn, then
-								alternate between user and agent turns.
+								No conversation turns yet. Start by adding a user turn.
 							</p>
 						</div>
 					) : (
-						history.map((turn, index) => (
-							<ConversationTurnComponent
-								// Use a composite key instead of the raw array index
-								key={`${turn.role}-${index}`}
-								turn={turn}
-								index={index}
-								isLast={index === history.length - 1}
-								onUpdate={(content) => handleUpdateTurn(index, content)}
-								onUpdateExpectedBehavior={
-									turn.role === "agent"
-										? (behaviors) =>
-												handleUpdateExpectedBehavior(index, behaviors)
-										: undefined
-								}
-								onDelete={() => handleRemoveTurn(index)}
-								onRegenerate={
-									turn.role === "agent"
-										? () => {
-												void handleRegenerate(index);
-											}
-										: undefined
-								}
-								isGenerating={isGenerating && generatingMessageIndex === index}
-								canEdit={canEdit && !readOnly}
-								referenceCount={
-									turn.role === "agent" ? referenceCounts.get(index) : undefined
-								}
-								onViewReferences={
-									turn.role === "agent"
-										? () => setViewingReferencesForTurn(index)
-										: undefined
-								}
-							/>
-						))
-					)}
-					{isGenerating && generatingMessageIndex === history.length && (
-						<ConversationTurnComponent
-							key={`pending-agent-${history.length}`}
-							turn={{ role: "agent", content: "" }}
-							index={history.length}
-							isLast
-							onUpdate={() => {}}
-							onDelete={() => {}}
-							canEdit={false}
-							isGenerating
-						/>
+						history.map((turn, idx) => {
+							const turnKey =
+								turn.turnId || turn.stepId || `${turn.role}-${String(idx)}`;
+							return (
+								<ConversationTurnComponent
+									key={turnKey}
+									turn={turn}
+									index={idx}
+									isLast={idx === history.length - 1}
+									onUpdate={(content) => handleUpdateTurn(idx, content)}
+									onDelete={() => handleRemoveTurn(idx)}
+									canEdit={canEdit && !readOnly}
+								/>
+							);
+						})
 					)}
 				</div>
 			</div>
 
-			{agentError && (
-				<div
-					className="mt-3 rounded-lg border border-rose-200 bg-rose-50 p-3 text-sm text-rose-800"
-					role="alert"
-				>
-					{agentError}
-				</div>
-			)}
-
 			{/* Add turn buttons */}
 			{!readOnly && canEdit && (
 				<div className="mt-4 flex gap-2 border-t border-slate-200 pt-4">
 					<button
 						type="button"
 						onClick={handleAddUserTurn}
-						disabled={!canAddUser || isGenerating}
-						title={
-							!canAddUser
-								? "Can only add user turn after agent turn or as first turn"
-								: "Add a new user turn"
-						}
+						disabled={!canAddUser}
+						title="Add a new user turn"
 						className="flex flex-1 items-center justify-center gap-2 rounded-lg border border-blue-200 bg-blue-50 px-4 py-2 text-sm font-medium text-blue-700 hover:bg-blue-100 disabled:cursor-not-allowed disabled:opacity-50 disabled:hover:bg-blue-50"
 					>
 						<UserCircle className="h-4 w-4" />
@@ -323,12 +166,8 @@ export default function MultiTurnEditor({
 					<button
 						type="button"
 						onClick={handleAddAgentTurn}
-						disabled={!canAddAgent || isGenerating}
-						title={
-							!canAddAgent
-								? "Can only add agent turn after user turn"
-								: "Add a new agent turn"
-						}
+						disabled={!canAddAgent}
+						title="Add a new agent turn"
 						className="flex flex-1 items-center justify-center gap-2 rounded-lg border border-violet-200 bg-violet-50 px-4 py-2 text-sm font-medium text-violet-700 hover:bg-violet-100 disabled:cursor-not-allowed disabled:opacity-50 disabled:hover:bg-violet-50"
 					>
 						<MessageCircle className="h-4 w-4" />
@@ -369,30 +208,6 @@ export default function MultiTurnEditor({
 				</div>
 			)}
 
-			{/* Turn References Modal */}
-			{viewingReferencesForTurn !== null && (
-				<TurnReferencesModal
-					isOpen={true}
-					onClose={() => setViewingReferencesForTurn(null)}
-					messageIndex={viewingReferencesForTurn}
-					references={references}
-					onUpdateReference={onUpdateReference}
-					onRemoveReference={onRemoveReference}
-					onOpenReference={onOpenReference}
-					readOnly={readOnly}
-					query={query}
-					setQuery={setQuery}
-					searching={searching}
-					searchResults={searchResults}
-					onRunSearch={runSearch}
-					onAddSearchResult={(ref) => {
-						if (onAddReferences) {
-							onAddReferences([ref]);
-						}
-					}}
-				/>
-			)}
-
 			{/* Ground Truth Tags Modal */}
 			{managingGroundTruthTags && current && (
 				<TagsModal
diff --git a/frontend/src/components/app/editor/TurnReferencesModal.tsx b/frontend/src/components/app/editor/TurnReferencesModal.tsx
index 82f619c..8d77bb6 100644
--- a/frontend/src/components/app/editor/TurnReferencesModal.tsx
+++ b/frontend/src/components/app/editor/TurnReferencesModal.tsx
@@ -10,13 +10,15 @@ import useModalKeys from "../../../hooks/useModalKeys";
 import { useToasts } from "../../../hooks/useToasts";
 import type { Reference } from "../../../models/groundTruth";
 import { cn, normalizeUrl, urlToTitle } from "../../../models/utils";
-import { getCachedConfig } from "../../../services/runtimeConfig";
+import { getReferenceApprovalRequirements } from "../../../models/validators";
+import { useRuntimeConfig } from "../../../services/runtimeConfig";
 import ModalPortal from "../../modals/ModalPortal";
 
 type Props = {
 	isOpen: boolean;
 	onClose: () => void;
 	messageIndex: number;
+	turnId?: string;
 	references: Reference[];
 	onUpdateReference: (refId: string, partial: Partial<Reference>) => void;
 	onRemoveReference: (refId: string) => void;
@@ -35,6 +37,7 @@ export default function TurnReferencesModal({
 	isOpen,
 	onClose,
 	messageIndex,
+	turnId,
 	references,
 	onUpdateReference,
 	onRemoveReference,
@@ -53,9 +56,11 @@ export default function TurnReferencesModal({
 		new Set(),
 	);
 
-	const config = getCachedConfig();
-	const requireVisit = config?.requireReferenceVisit ?? true;
-	const requireKeyPara = config?.requireKeyParagraph ?? false;
+	const runtimeConfig = useRuntimeConfig();
+	const {
+		requireReferenceVisit: requireVisit,
+		requireKeyParagraph: requireKeyPara,
+	} = getReferenceApprovalRequirements(runtimeConfig);
 	const { toasts, showToast, dismiss } = useToasts();
 	const undoTimerRef = useRef<number | null>(null);
 
@@ -93,8 +98,11 @@ export default function TurnReferencesModal({
 
 	// Filter references for this specific turn only
 	const turnRefs = useMemo(
-		() => references.filter((r) => r.messageIndex === messageIndex),
-		[references, messageIndex],
+		() =>
+			references.filter((r) =>
+				turnId ? r.turnId === turnId : r.messageIndex === messageIndex,
+			),
+		[messageIndex, references, turnId],
 	);
 
 	// References already added to this turn (by URL for duplicate prevention)
@@ -121,8 +129,7 @@ export default function TurnReferencesModal({
 	};
 
 	const handleAddSearchResult = (ref: Reference, silent = false) => {
-		// Assign messageIndex automatically
-		onAddSearchResult({ ...ref, messageIndex });
+		onAddSearchResult({ ...ref, messageIndex, turnId });
 		setSelectedSearchIds((prev) => {
 			const next = new Set(prev);
 			next.delete(ref.id);
@@ -141,7 +148,7 @@ export default function TurnReferencesModal({
 			selectedSearchIds.has(r.id),
 		);
 		chosen.forEach((r) => {
-			onAddSearchResult({ ...r, messageIndex });
+			onAddSearchResult({ ...r, messageIndex, turnId });
 		});
 		setSelectedSearchIds(new Set());
 		showToast(
diff --git a/frontend/src/components/app/editors/ContextEntryEditor.tsx b/frontend/src/components/app/editors/ContextEntryEditor.tsx
new file mode 100644
index 0000000..09f71e3
--- /dev/null
+++ b/frontend/src/components/app/editors/ContextEntryEditor.tsx
@@ -0,0 +1,167 @@
+/**
+ * ContextEntryEditor — editable list of context entries (key-value pairs).
+ *
+ * Supports inline editing, adding new entries, and removing entries.
+ * Changes propagate through onUpdate callback which receives the full
+ * updated entries array.
+ *
+ * Phase 3 Step 3.3.
+ */
+
+import { Plus, Trash2 } from "lucide-react";
+import { useCallback, useEffect, useRef, useState } from "react";
+import type { ContextEntry } from "../../../models/groundTruth";
+import { cn } from "../../../models/utils";
+
+function ContextEntryRow({
+	entry,
+	onUpdateKey,
+	onUpdateValue,
+	onRemove,
+}: {
+	entry: ContextEntry;
+	onUpdateKey: (key: string) => void;
+	onUpdateValue: (value: unknown) => void;
+	onRemove: () => void;
+}) {
+	const [editing, setEditing] = useState(false);
+	const textareaRef = useRef<HTMLTextAreaElement>(null);
+	const [localValue, setLocalValue] = useState(() =>
+		typeof entry.value === "string"
+			? entry.value
+			: JSON.stringify(entry.value, null, 2),
+	);
+
+	// Focus textarea when entering edit mode
+	useEffect(() => {
+		if (editing) textareaRef.current?.focus();
+	}, [editing]);
+
+	const commitValue = useCallback(() => {
+		setEditing(false);
+		// Try to parse as JSON; fall back to string
+		let parsed: unknown;
+		try {
+			parsed = JSON.parse(localValue);
+		} catch {
+			parsed = localValue;
+		}
+		onUpdateValue(parsed);
+	}, [localValue, onUpdateValue]);
+
+	return (
+		<div className="group rounded-lg border border-slate-200 bg-slate-50 p-3 space-y-1">
+			<div className="flex items-start gap-2">
+				<input
+					type="text"
+					aria-label="Context entry key"
+					className="font-mono text-xs text-slate-600 bg-transparent border-b border-dashed border-slate-300 focus:border-violet-400 focus:outline-none px-0.5 py-0 flex-shrink-0 w-32"
+					value={entry.key}
+					onChange={(e) => onUpdateKey(e.target.value)}
+				/>
+				<span className="text-xs text-slate-400 flex-shrink-0">:</span>
+
+				{editing ? (
+					<textarea
+						ref={textareaRef}
+						aria-label="Context entry value"
+						className="flex-1 min-h-[3rem] resize-y rounded-md border border-slate-300 bg-white p-2 text-xs text-slate-700 font-mono focus:outline-none focus:ring-1 focus:ring-violet-300"
+						value={localValue}
+						onChange={(e) => setLocalValue(e.target.value)}
+						onBlur={commitValue}
+						onKeyDown={(e) => {
+							if (e.key === "Enter" && !e.shiftKey) {
+								e.preventDefault();
+								commitValue();
+							}
+						}}
+					/>
+				) : (
+					<button
+						type="button"
+						className={cn(
+							"flex-1 text-left text-xs text-slate-700 break-all rounded px-1 py-0.5",
+							"hover:bg-violet-50 hover:text-violet-800 cursor-text transition-colors",
+						)}
+						onClick={() => {
+							setLocalValue(
+								typeof entry.value === "string"
+									? entry.value
+									: JSON.stringify(entry.value, null, 2),
+							);
+							setEditing(true);
+						}}
+						title="Click to edit"
+					>
+						{typeof entry.value === "object"
+							? JSON.stringify(entry.value)
+							: String(entry.value ?? "")}
+					</button>
+				)}
+
+				<button
+					type="button"
+					onClick={onRemove}
+					className="flex-shrink-0 rounded p-1 text-slate-400 opacity-0 group-hover:opacity-100 hover:bg-rose-50 hover:text-rose-600 transition-all"
+					aria-label={`Remove context entry ${entry.key}`}
+					title="Remove entry"
+				>
+					<Trash2 className="h-3.5 w-3.5" />
+				</button>
+			</div>
+		</div>
+	);
+}
+
+export default function ContextEntryEditor({
+	entries,
+	onUpdate,
+}: {
+	entries: ContextEntry[];
+	onUpdate: (entries: ContextEntry[]) => void;
+}) {
+	const updateEntry = useCallback(
+		(index: number, patch: Partial<ContextEntry>) => {
+			const next = entries.map((e, i) =>
+				i === index ? { ...e, ...patch } : e,
+			);
+			onUpdate(next);
+		},
+		[entries, onUpdate],
+	);
+
+	const removeEntry = useCallback(
+		(index: number) => {
+			onUpdate(entries.filter((_, i) => i !== index));
+		},
+		[entries, onUpdate],
+	);
+
+	const addEntry = useCallback(() => {
+		onUpdate([...entries, { key: "", value: "" }]);
+	}, [entries, onUpdate]);
+
+	return (
+		<div className="space-y-2">
+			{entries.map((entry, i) => (
+				<ContextEntryRow
+					// biome-ignore lint/suspicious/noArrayIndexKey: entries have no stable id
+					key={i}
+					entry={entry}
+					onUpdateKey={(key) => updateEntry(i, { key })}
+					onUpdateValue={(value) => updateEntry(i, { value })}
+					onRemove={() => removeEntry(i)}
+				/>
+			))}
+
+			<button
+				type="button"
+				onClick={addEntry}
+				className="flex items-center gap-1.5 rounded-lg border border-dashed border-slate-300 px-3 py-2 text-xs text-slate-500 hover:border-violet-400 hover:text-violet-600 hover:bg-violet-50 transition-colors w-full justify-center"
+			>
+				<Plus className="h-3.5 w-3.5" />
+				Add context entry
+			</button>
+		</div>
+	);
+}
diff --git a/frontend/src/components/app/editors/ToolCallDetailView.tsx b/frontend/src/components/app/editors/ToolCallDetailView.tsx
new file mode 100644
index 0000000..e896957
--- /dev/null
+++ b/frontend/src/components/app/editors/ToolCallDetailView.tsx
@@ -0,0 +1,249 @@
+/**
+ * ToolCallDetailView -- compact grid-based tool call card aligned with wireframe v2.2.
+ *
+ * Header: 6-column CSS grid
+ *   Col 1: Order badge (#1)
+ *   Col 2: Function name in dark bar
+ *   Col 3: Parallel group badge (if any)
+ *   Col 4: Execution time
+ *   Col 5: Decision badge (Required / Optional / Not needed)
+ *   Col 6: Expand/collapse toggle
+ *
+ * Expanded:
+ *   - "Was this call needed?" segmented toggle (when onUpdateExpectedTools provided)
+ *   - Arguments + Result in dark code blocks
+ */
+
+import { useMemo, useState } from "react";
+import type {
+	ExpectedTools,
+	GroundTruthItem,
+	Reference,
+	ToolCallRecord,
+} from "../../../models/groundTruth";
+import { ToolCallExtensionRenderer } from "../../../registry";
+import { getExecTime } from "../TracePanel";
+import {
+	getToolState,
+	type NecessityState,
+	setToolNecessity,
+} from "./toolNecessity";
+
+// ---------------------------------------------------------------------------
+// Decision segments -- mirrors wireframe DECISION_SEGMENTS
+// ---------------------------------------------------------------------------
+
+const DECISION_SEGMENTS: {
+	value: NecessityState;
+	symbol: string;
+	label: string;
+	title: string;
+	activeClasses: string;
+	dotClasses: string;
+}[] = [
+	{
+		value: "required",
+		symbol: "\u2605",
+		label: "\u2605 Required",
+		title: "Needed to reach the correct answer",
+		activeClasses: "bg-emerald-600 text-white shadow-sm",
+		dotClasses: "bg-emerald-600 text-white",
+	},
+	{
+		value: "optional",
+		symbol: "\u25CB",
+		label: "\u25CB Optional",
+		title: "Fine to call but not essential",
+		activeClasses: "bg-sky-600 text-white shadow-sm",
+		dotClasses: "bg-sky-600 text-white",
+	},
+	{
+		value: "not-needed",
+		symbol: "\u2715",
+		label: "\u2715 Not needed",
+		title: "Should not have been called",
+		activeClasses: "bg-rose-600 text-white shadow-sm",
+		dotClasses: "bg-rose-600 text-white",
+	},
+];
+
+// ---------------------------------------------------------------------------
+// Component
+// ---------------------------------------------------------------------------
+
+export default function ToolCallDetailView({
+	tc,
+	index,
+	item,
+	expectedTools,
+	onUpdateExpectedTools,
+	references,
+	onAddReferences,
+	onOpenReference,
+	onUpdateReference,
+	onRemoveReference,
+}: {
+	tc: ToolCallRecord;
+	index: number;
+	item: GroundTruthItem;
+	expectedTools?: ExpectedTools;
+	onUpdateExpectedTools?: (tools: ExpectedTools) => void;
+	references?: Reference[];
+	onAddReferences?: (refs: Reference[]) => void;
+	onOpenReference?: (ref: Reference) => void;
+	onUpdateReference?: (refId: string, partial: Partial<Reference>) => void;
+	onRemoveReference?: (refId: string) => void;
+}) {
+	const [expanded, setExpanded] = useState(false);
+
+	const hasArgs = tc.arguments != null && Object.keys(tc.arguments).length > 0;
+	const hasResponse = tc.response !== undefined && tc.response !== null;
+	const execTime = getExecTime(tc.response);
+
+	const decision = getToolState(tc.name, expectedTools);
+	const seg =
+		DECISION_SEGMENTS.find((s) => s.value === decision) ?? DECISION_SEGMENTS[1];
+
+	// Build extension context once
+	const extensionContext = useMemo(
+		() => ({
+			item,
+			readOnly: !onUpdateExpectedTools,
+		}),
+		[item, onUpdateExpectedTools],
+	);
+
+	return (
+		<div className="mt-2 rounded-xl border border-slate-200 bg-white">
+			{/* Header */}
+			<button
+				type="button"
+				className="grid w-full items-center p-3 cursor-pointer hover:bg-slate-50/50 rounded-xl gap-x-2 text-left"
+				style={{
+					gridTemplateColumns: "2rem minmax(0,1fr) 3.5rem 4rem 1.25rem 0.75rem",
+				}}
+				onClick={() => setExpanded((v) => !v)}
+				aria-expanded={expanded}
+				aria-label={`Toggle tool call ${tc.name}`}
+			>
+				{/* Col 1: Order */}
+				<span className="rounded-full bg-violet-100 px-2 py-0.5 text-xs font-semibold text-violet-800 text-center">
+					#{tc.stepNumber ?? index + 1}
+				</span>
+
+				{/* Col 2: Function name */}
+				<span
+					className="rounded-lg bg-slate-700 px-2 py-0.5 text-xs font-mono text-white truncate"
+					title={tc.name}
+				>
+					{tc.name}
+				</span>
+
+				{/* Col 3: Parallel group */}
+				{tc.parallelGroup ? (
+					<span className="text-xs font-medium text-amber-600 text-center whitespace-nowrap">
+						{"\u2016"} {tc.parallelGroup}
+					</span>
+				) : (
+					<span />
+				)}
+
+				{/* Col 4: Execution time */}
+				<span className="text-xs text-slate-400 text-right whitespace-nowrap tabular-nums">
+					{execTime !== null ? `${execTime.toFixed(2)}s` : "\u2014"}
+				</span>
+
+				{/* Col 5: Decision badge */}
+				<span
+					className={`inline-flex items-center justify-center rounded-full w-5 h-5 text-xs font-bold ${seg.dotClasses}`}
+					title={seg.title}
+				>
+					{seg.symbol}
+				</span>
+
+				{/* Col 6: Expand/collapse */}
+				<span className="text-xs text-slate-400 text-center">
+					{expanded ? "\u25BE" : "\u25B8"}
+				</span>
+			</button>
+
+			{/* Expanded detail */}
+			{expanded && (
+				<>
+					{/* Decision toggle */}
+					{onUpdateExpectedTools && (
+						<div className="border-t border-slate-100 px-3 py-3 text-xs text-slate-700">
+							<div className="mb-2 font-semibold uppercase tracking-wide text-slate-500">
+								Was this call needed for the correct answer?
+							</div>
+							<div
+								className="inline-flex rounded-lg border border-slate-200 bg-slate-100 p-0.5"
+								role="radiogroup"
+								aria-label="Tool call relevance"
+							>
+								{DECISION_SEGMENTS.map((s) => (
+									<button
+										key={s.value}
+										type="button"
+										className={`relative select-none rounded-md px-3 py-1.5 text-xs font-semibold transition-all duration-150 ${
+											decision === s.value
+												? s.activeClasses
+												: "text-slate-500 hover:text-slate-700"
+										}`}
+										aria-pressed={decision === s.value}
+										title={s.title}
+										onClick={() =>
+											onUpdateExpectedTools(
+												setToolNecessity(tc.name, s.value, expectedTools),
+											)
+										}
+									>
+										{s.label}
+									</button>
+								))}
+							</div>
+						</div>
+					)}
+
+					{/* Arguments + Result */}
+					<div className="border-t p-3 space-y-2">
+						{hasArgs && (
+							<>
+								<div className="text-xs font-semibold text-slate-600 uppercase tracking-wide">
+									Arguments
+								</div>
+								<pre className="rounded-lg bg-slate-800 p-3 text-xs text-green-400 overflow-x-auto whitespace-pre-wrap">
+									{JSON.stringify(tc.arguments, null, 2)}
+								</pre>
+							</>
+						)}
+
+						{hasResponse && (
+							<>
+								<div className="text-xs font-semibold text-slate-600 uppercase tracking-wide mt-2">
+									Result
+								</div>
+								<pre className="rounded-lg bg-slate-800 p-3 text-xs text-green-400 overflow-x-auto max-h-60 overflow-y-auto whitespace-pre-wrap">
+									{typeof tc.response === "object"
+										? JSON.stringify(tc.response, null, 2)
+										: String(tc.response)}
+								</pre>
+							</>
+						)}
+					</div>
+
+					{/* Plugin-contributed tool call actions */}
+					<ToolCallExtensionRenderer
+						toolCall={tc}
+						context={extensionContext}
+						references={references ?? []}
+						onAddReferences={onAddReferences}
+						onOpenReference={onOpenReference}
+						onUpdateReference={onUpdateReference}
+						onRemoveReference={onRemoveReference}
+					/>
+				</>
+			)}
+		</div>
+	);
+}
diff --git a/frontend/src/components/app/editors/ToolCallReferencesAction.tsx b/frontend/src/components/app/editors/ToolCallReferencesAction.tsx
new file mode 100644
index 0000000..e0ad1d9
--- /dev/null
+++ b/frontend/src/components/app/editors/ToolCallReferencesAction.tsx
@@ -0,0 +1,102 @@
+/**
+ * ToolCallReferencesAction — rag-compat plugin action for retrieval tool calls.
+ *
+ * Renders a reference count badge / "Add reference" button inside the
+ * expanded ToolCallDetailView. Clicking toggles an inline references list.
+ */
+
+import { ExternalLink, Paperclip, Trash2 } from "lucide-react";
+import { useMemo, useState } from "react";
+import { normalizeUrl, urlToTitle } from "../../../models/utils";
+import type { ToolCallActionProps } from "../../../registry/types";
+
+export default function ToolCallReferencesAction({
+	toolCall,
+	context,
+	references,
+	onOpenReference,
+	onRemoveReference,
+}: ToolCallActionProps) {
+	const [expanded, setExpanded] = useState(false);
+
+	// References scoped to this tool call
+	const toolCallRefs = useMemo(
+		() => references.filter((r) => r.toolCallId === toolCall.id),
+		[references, toolCall.id],
+	);
+
+	const count = toolCallRefs.length;
+	const readOnly = context.readOnly;
+
+	return (
+		<div className="border-t border-slate-100">
+			{/* Badge / toggle button */}
+			<button
+				type="button"
+				onClick={() => setExpanded((v) => !v)}
+				className="flex w-full items-center gap-2 px-3 py-2 text-xs hover:bg-slate-50"
+			>
+				<Paperclip className="h-3 w-3 text-violet-600" />
+				{count > 0 ? (
+					<span className="font-medium text-violet-700">
+						{count} reference{count !== 1 ? "s" : ""}
+					</span>
+				) : (
+					<span className="font-medium text-violet-700">Add reference</span>
+				)}
+				<span className="ml-auto text-slate-400">{expanded ? "▾" : "▸"}</span>
+			</button>
+
+			{/* Inline references list */}
+			{expanded && (
+				<div className="border-t border-slate-100 px-3 py-2 space-y-2">
+					{toolCallRefs.length === 0 && (
+						<p className="text-xs text-slate-500">
+							No references for this tool call yet.
+						</p>
+					)}
+					{toolCallRefs.map((ref, i) => (
+						<div
+							key={ref.id}
+							className="flex items-start justify-between gap-2 rounded-lg border border-slate-200 p-2 text-xs"
+						>
+							<div className="min-w-0 flex-1">
+								<div className="truncate font-medium text-slate-700">
+									[{i + 1}] {ref.title || urlToTitle(ref.url)}
+								</div>
+								<a
+									className="inline-flex max-w-full items-center gap-1 truncate text-[11px] text-violet-700 underline"
+									onClick={(e) => {
+										e.preventDefault();
+										onOpenReference?.(ref);
+									}}
+									href={normalizeUrl(ref.url)}
+									target="_blank"
+									rel="noopener noreferrer"
+								>
+									<ExternalLink className="h-3 w-3" />
+									{normalizeUrl(ref.url)}
+								</a>
+								{ref.bonus && (
+									<span className="ml-2 rounded-full bg-violet-100 px-1.5 py-0.5 text-[10px] text-violet-700">
+										Bonus
+									</span>
+								)}
+							</div>
+							{!readOnly && onRemoveReference && (
+								<button
+									type="button"
+									title="Remove reference"
+									className="flex-none rounded border border-rose-200 p-1 text-rose-600 hover:bg-rose-50"
+									onClick={() => onRemoveReference(ref.id)}
+								>
+									<Trash2 className="h-3 w-3" />
+								</button>
+							)}
+						</div>
+					))}
+				</div>
+			)}
+		</div>
+	);
+}
diff --git a/frontend/src/components/app/editors/ToolNecessityEditor.tsx b/frontend/src/components/app/editors/ToolNecessityEditor.tsx
new file mode 100644
index 0000000..4552a50
--- /dev/null
+++ b/frontend/src/components/app/editors/ToolNecessityEditor.tsx
@@ -0,0 +1,126 @@
+/**
+ * ToolNecessityEditor — tri-state toggle per tool name for classifying
+ * tool calls as required / optional / not-needed.
+ *
+ * Derives the set of tool names from the union of toolCalls and
+ * existing expectedTools entries. Each toggle updates the parent
+ * ExpectedTools state via onUpdate.
+ *
+ * Phase 4 Step 4.2.
+ */
+
+import { useCallback, useMemo } from "react";
+import type {
+	ExpectedTools,
+	ToolCallRecord,
+} from "../../../models/groundTruth";
+import { cn } from "../../../models/utils";
+import {
+	getToolState,
+	type NecessityState,
+	setToolNecessity,
+} from "./toolNecessity";
+
+type ToolRow = {
+	name: string;
+	state: NecessityState;
+};
+
+const STATES: {
+	value: NecessityState;
+	label: string;
+	color: string;
+	activeColor: string;
+}[] = [
+	{
+		value: "required",
+		label: "Required",
+		color: "text-slate-500",
+		activeColor: "bg-emerald-100 text-emerald-800 ring-1 ring-emerald-300",
+	},
+	{
+		value: "optional",
+		label: "Optional",
+		color: "text-slate-500",
+		activeColor: "bg-amber-100 text-amber-800 ring-1 ring-amber-300",
+	},
+	{
+		value: "not-needed",
+		label: "Not needed",
+		color: "text-slate-500",
+		activeColor: "bg-rose-100 text-rose-800 ring-1 ring-rose-300",
+	},
+];
+
+export default function ToolNecessityEditor({
+	toolCalls,
+	expectedTools,
+	onUpdate,
+}: {
+	toolCalls: ToolCallRecord[];
+	expectedTools: ExpectedTools | undefined;
+	onUpdate: (next: ExpectedTools) => void;
+}) {
+	// Collect all unique tool names from toolCalls + expectedTools
+	const rows: ToolRow[] = useMemo(() => {
+		const names = new Set<string>();
+		for (const tc of toolCalls) names.add(tc.name);
+		for (const t of expectedTools?.required ?? []) names.add(t.name);
+		for (const t of expectedTools?.optional ?? []) names.add(t.name);
+		for (const t of expectedTools?.notNeeded ?? []) names.add(t.name);
+		return [...names].sort().map((name) => ({
+			name,
+			state: getToolState(name, expectedTools),
+		}));
+	}, [toolCalls, expectedTools]);
+
+	const handleToggle = useCallback(
+		(toolName: string, target: NecessityState) => {
+			onUpdate(setToolNecessity(toolName, target, expectedTools));
+		},
+		[expectedTools, onUpdate],
+	);
+
+	if (rows.length === 0) {
+		return (
+			<div className="text-xs italic text-slate-400">
+				No tool calls to classify.
+			</div>
+		);
+	}
+
+	return (
+		<div className="space-y-2">
+			{rows.map((row) => (
+				<div
+					key={row.name}
+					className="flex items-center gap-3 rounded-lg border border-slate-200 bg-slate-50 px-3 py-2"
+				>
+					<span className="text-sm font-mono text-slate-800 flex-1 truncate">
+						{row.name}
+					</span>
+					<div className="flex gap-1">
+						{STATES.map((s) => (
+							<button
+								key={s.value}
+								type="button"
+								className={cn(
+									"rounded-md px-2 py-1 text-xs font-medium transition-colors",
+									row.state === s.value
+										? s.activeColor
+										: "bg-slate-100 hover:bg-slate-200",
+									row.state !== s.value && s.color,
+								)}
+								onClick={() => handleToggle(row.name, s.value)}
+								aria-pressed={row.state === s.value}
+								aria-label={`Set ${row.name} to ${s.label}`}
+							>
+								{s.label}
+							</button>
+						))}
+					</div>
+				</div>
+			))}
+		</div>
+	);
+}
diff --git a/frontend/src/components/app/editors/toolNecessity.ts b/frontend/src/components/app/editors/toolNecessity.ts
new file mode 100644
index 0000000..7224032
--- /dev/null
+++ b/frontend/src/components/app/editors/toolNecessity.ts
@@ -0,0 +1,75 @@
+import type {
+	ExpectedTools,
+	ToolExpectation,
+} from "../../../models/groundTruth";
+
+export type NecessityState = "required" | "optional" | "not-needed";
+
+const EXPECTED_TOOL_BUCKETS = ["required", "optional", "notNeeded"] as const;
+
+function findToolExpectation(
+	toolName: string,
+	current: ExpectedTools | undefined,
+): ToolExpectation | undefined {
+	for (const bucket of EXPECTED_TOOL_BUCKETS) {
+		const match = current?.[bucket]?.find((tool) => tool.name === toolName);
+		if (match) {
+			return match;
+		}
+	}
+
+	return undefined;
+}
+
+function removeFromBucket(
+	entries: ToolExpectation[] | undefined,
+	toolName: string,
+): ToolExpectation[] {
+	return (entries ?? []).filter((tool) => tool.name !== toolName);
+}
+
+export function getToolState(
+	name: string,
+	expectedTools: ExpectedTools | undefined,
+): NecessityState {
+	if (expectedTools?.required?.some((tool) => tool.name === name)) {
+		return "required";
+	}
+	if (expectedTools?.optional?.some((tool) => tool.name === name)) {
+		return "optional";
+	}
+	if (expectedTools?.notNeeded?.some((tool) => tool.name === name)) {
+		return "not-needed";
+	}
+
+	return "optional";
+}
+
+export function setToolNecessity(
+	toolName: string,
+	target: NecessityState,
+	current: ExpectedTools | undefined,
+): ExpectedTools {
+	const entry = findToolExpectation(toolName, current) ?? { name: toolName };
+	const required = removeFromBucket(current?.required, toolName);
+	const optional = removeFromBucket(current?.optional, toolName);
+	const notNeeded = removeFromBucket(current?.notNeeded, toolName);
+
+	switch (target) {
+		case "required":
+			required.push(entry);
+			break;
+		case "optional":
+			optional.push(entry);
+			break;
+		case "not-needed":
+			notNeeded.push(entry);
+			break;
+	}
+
+	return {
+		required,
+		optional,
+		notNeeded,
+	};
+}
diff --git a/frontend/src/components/app/layout/EvidenceDrawer.tsx b/frontend/src/components/app/layout/EvidenceDrawer.tsx
new file mode 100644
index 0000000..3cca23c
--- /dev/null
+++ b/frontend/src/components/app/layout/EvidenceDrawer.tsx
@@ -0,0 +1,97 @@
+/**
+ * EvidenceDrawer — mobile slide-in drawer for the evidence/references panel.
+ *
+ * Below the `lg` breakpoint the evidence pane is hidden from the main layout
+ * and instead rendered inside this overlay drawer that slides in from the right.
+ * Closes on outside click or via the explicit close button.
+ *
+ * Phase 3 Step 3.2.
+ */
+
+import { useEffect, useRef } from "react";
+import { cn } from "../../../models/utils";
+
+export default function EvidenceDrawer({
+	open,
+	onClose,
+	children,
+}: {
+	open: boolean;
+	onClose: () => void;
+	children: React.ReactNode;
+}) {
+	const drawerRef = useRef<HTMLDivElement>(null);
+
+	// Close on Escape
+	useEffect(() => {
+		if (!open) return;
+		function handleKey(e: KeyboardEvent) {
+			if (e.key === "Escape") onClose();
+		}
+		window.addEventListener("keydown", handleKey);
+		return () => window.removeEventListener("keydown", handleKey);
+	}, [open, onClose]);
+
+	// Close on outside click
+	useEffect(() => {
+		if (!open) return;
+		function handleClick(e: MouseEvent) {
+			if (drawerRef.current && !drawerRef.current.contains(e.target as Node)) {
+				onClose();
+			}
+		}
+		// Delay listener to avoid capturing the toggle click itself
+		const timer = setTimeout(
+			() => window.addEventListener("mousedown", handleClick),
+			0,
+		);
+		return () => {
+			clearTimeout(timer);
+			window.removeEventListener("mousedown", handleClick);
+		};
+	}, [open, onClose]);
+
+	return (
+		<>
+			{/* Backdrop */}
+			<div
+				className={cn(
+					"fixed inset-0 z-40 bg-black/30 transition-opacity duration-200",
+					open
+						? "opacity-100 pointer-events-auto"
+						: "opacity-0 pointer-events-none",
+				)}
+				aria-hidden="true"
+			/>
+
+			{/* Drawer panel */}
+			<div
+				ref={drawerRef}
+				role="dialog"
+				aria-modal={open}
+				aria-label="Evidence panel"
+				className={cn(
+					"fixed inset-y-0 right-0 z-50 w-[85vw] max-w-md transform transition-transform duration-200 ease-out bg-white shadow-xl overflow-y-auto",
+					open ? "translate-x-0" : "translate-x-full",
+				)}
+			>
+				{/* Close button */}
+				<div className="sticky top-0 z-10 flex items-center justify-between border-b bg-white px-3 py-2">
+					<span className="text-sm font-medium text-slate-700">
+						Evidence &amp; References
+					</span>
+					<button
+						type="button"
+						onClick={onClose}
+						className="rounded-lg p-1.5 text-slate-500 hover:bg-slate-100 hover:text-slate-700"
+						aria-label="Close evidence panel"
+					>
+						✕
+					</button>
+				</div>
+
+				<div className="p-3">{open ? children : null}</div>
+			</div>
+		</>
+	);
+}
diff --git a/frontend/src/components/app/layout/SplitPaneLayout.tsx b/frontend/src/components/app/layout/SplitPaneLayout.tsx
new file mode 100644
index 0000000..449c474
--- /dev/null
+++ b/frontend/src/components/app/layout/SplitPaneLayout.tsx
@@ -0,0 +1,69 @@
+/**
+ * SplitPaneLayout — resizable split-pane wrapper for the curate workspace.
+ *
+ * Uses react-resizable-panels to provide a draggable gutter between the
+ * editor (left) and evidence (right) panes. Persists gutter position in
+ * localStorage so it survives across sessions.
+ *
+ * Replaces the prior CSS grid layout (demo.tsx:277-431) per Phase 3 Step 3.1.
+ */
+
+import { useCallback, useState } from "react";
+import type { Layout } from "react-resizable-panels";
+import { Group, Panel, Separator } from "react-resizable-panels";
+
+const STORAGE_KEY = "gtc-split-pane-sizes";
+const MIN_SIZE_PERCENT = 20;
+const LEFT_PANEL_ID = "editor";
+const RIGHT_PANEL_ID = "evidence";
+
+function loadLayout(): Layout | undefined {
+	try {
+		const raw = localStorage.getItem(STORAGE_KEY);
+		if (raw) return JSON.parse(raw) as Layout;
+	} catch {
+		// Ignore corrupt data
+	}
+	return undefined;
+}
+
+export default function SplitPaneLayout({
+	left,
+	right,
+	className,
+}: {
+	left: React.ReactNode;
+	right: React.ReactNode;
+	className?: string;
+}) {
+	const [defaultLayout] = useState(
+		() => loadLayout() ?? { [LEFT_PANEL_ID]: 60, [RIGHT_PANEL_ID]: 40 },
+	);
+
+	const handleLayoutChanged = useCallback((layout: Layout) => {
+		try {
+			localStorage.setItem(STORAGE_KEY, JSON.stringify(layout));
+		} catch {
+			// Storage full or unavailable
+		}
+	}, []);
+
+	return (
+		<Group
+			orientation="horizontal"
+			defaultLayout={defaultLayout}
+			onLayoutChanged={handleLayoutChanged}
+			className={className}
+		>
+			<Panel id={LEFT_PANEL_ID} minSize={MIN_SIZE_PERCENT}>
+				{left}
+			</Panel>
+			<Separator className="group mx-1 flex w-2 items-center justify-center rounded-full transition-colors hover:bg-violet-100 active:bg-violet-200 focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-violet-400">
+				<div className="h-8 w-0.5 rounded-full bg-slate-300 transition-colors group-hover:bg-violet-500 group-active:bg-violet-600" />
+			</Separator>
+			<Panel id={RIGHT_PANEL_ID} minSize={MIN_SIZE_PERCENT}>
+				{right}
+			</Panel>
+		</Group>
+	);
+}
diff --git a/frontend/src/components/app/pages/CuratePane.tsx b/frontend/src/components/app/pages/CuratePane.tsx
index 325eb89..9f539dd 100644
--- a/frontend/src/components/app/pages/CuratePane.tsx
+++ b/frontend/src/components/app/pages/CuratePane.tsx
@@ -1,42 +1,15 @@
 import { Check, RefreshCw, Save, Trash2 } from "lucide-react";
-import { useEffect, useRef, useState } from "react";
-
-// Determine if it's appropriate to move focus into the editor
-function shouldStealFocus(): boolean {
-	const ae = typeof document !== "undefined" ? document.activeElement : null;
-	if (!ae) return true;
-	// If focus is already on an editable control, don't steal.
-	if (ae instanceof HTMLElement) {
-		const tag = ae.tagName.toLowerCase();
-		if (tag === "input" || tag === "textarea" || ae.isContentEditable) {
-			return false;
-		}
-	}
-	return true;
-}
-
-function moveCaretToEnd(el: HTMLTextAreaElement) {
-	const len = el.value?.length ?? 0;
-	try {
-		el.selectionStart = len;
-		el.selectionEnd = len;
-		// Ensure scroll position shows the end if content overflows
-		el.scrollTop = el.scrollHeight;
-	} catch {
-		// Some environments may not support selection APIs; ignore.
-	}
-}
 
 import useCurationInstructions from "../../../hooks/useCurationInstructions";
-import type { AgentGenerationResult } from "../../../hooks/useGroundTruth";
 import type {
 	ConversationTurn,
 	GroundTruthItem,
-	Reference,
 } from "../../../models/groundTruth";
 import { cn } from "../../../models/utils";
-import { validateConversationPattern } from "../../../models/validators";
-import TagsEditor from "../../app/editor/TagsEditor";
+import {
+	validateConversationPattern,
+	validateExpectedTools,
+} from "../../../models/validators";
 import InstructionsPane from "../../app/InstructionsPane";
 import defaultCurateMd from "../defaultCurateInstructions.md?raw";
 import MultiTurnEditor from "../editor/MultiTurnEditor";
@@ -45,433 +18,202 @@ export default function CuratePane({
 	current,
 	canApprove,
 	saving,
-	onUpdateQuestion,
-	onUpdateAnswer,
 	onUpdateComment,
 	onUpdateTags,
 	onUpdateHistory,
 	onDeleteTurn,
-	onGenerateAgentTurn,
 	onSaveDraft,
 	onApprove,
 	onSkip,
 	onDelete,
 	onRestore,
 	onDuplicate,
-	onUpdateReference,
-	onRemoveReference,
-	onOpenReference,
-	onAddReferences,
-	onEditorModeChange,
 	className,
 }: {
 	current: GroundTruthItem | null | undefined;
 	canApprove: boolean;
 	saving: boolean;
-	onUpdateQuestion: (v: string) => void;
-	onUpdateAnswer: (v: string) => void;
 	onUpdateComment: (v: string) => void;
 	onUpdateTags: (tags: string[]) => void;
 	onUpdateHistory: (history: ConversationTurn[]) => void;
 	onDeleteTurn: (messageIndex: number) => void;
-	onGenerateAgentTurn: (messageIndex: number) => Promise<AgentGenerationResult>;
 	onSaveDraft: () => void;
 	onApprove: () => void;
 	onSkip: () => void;
 	onDelete: () => void;
 	onRestore: () => void;
 	onDuplicate: () => void;
-	onUpdateReference: (refId: string, partial: Partial<Reference>) => void;
-	onRemoveReference: (refId: string) => void;
-	onOpenReference: (ref: Reference) => void;
-	onAddReferences?: (refs: Reference[]) => void;
-	onEditorModeChange?: (mode: "single" | "multi") => void;
 	className?: string;
 }) {
-	// Ref to the Question textarea for autofocus on item load/selection
-	const questionRef = useRef<HTMLTextAreaElement | null>(null);
-
-	// Multi-turn mode: always default to multi-turn mode (single-turn mode is disabled)
-	// editorMode is fixed to "multi" - keeping state for potential future use
-	const [editorMode] = useState<"single" | "multi">("multi");
-
-	// NOTE: Mode toggle disabled - refs and handler removed
-	// const userOverrideRef = useRef(false);
-	// const lastItemIdRef = useRef<string | null>(null);
-
-	// NOTE: Auto-switching logic disabled - always stay in multi-turn mode
-	// Update mode when current item changes based on whether it has history
-	// Only auto-switch if the user hasn't manually overridden the mode for the current item
-	// useEffect(() => {
-	// 	if (!current) return;
-	//
-	// 	// If we switched to a different item, reset the override flag
-	// 	if (current.id !== lastItemIdRef.current) {
-	// 		userOverrideRef.current = false;
-	// 		lastItemIdRef.current = current.id;
-	// 	}
-	//
-	// 	// Don't auto-switch if user manually overrode the mode
-	// 	if (userOverrideRef.current) return;
-	//
-	// 	// If item has history (multi-turn), switch to multi-turn mode
-	// 	if (isMultiTurn(current) && editorMode === "single") {
-	// 		setEditorMode("multi");
-	// 	}
-	// 	// If item lacks history (single-turn), switch to single-turn mode
-	// 	else if (!isMultiTurn(current) && editorMode === "multi") {
-	// 		setEditorMode("single");
-	// 	}
-	// }, [current, editorMode]);
-
-	// Notify parent when editor mode changes
-	useEffect(() => {
-		onEditorModeChange?.(editorMode);
-	}, [editorMode, onEditorModeChange]);
-
-	// NOTE: Mode toggle handler disabled, code kept for potential future use
-	// const handleModeToggle = (mode: "single" | "multi") => {
-	// 	// Mark that the user manually changed the mode
-	// 	userOverrideRef.current = true;
-	//
-	// 	// If switching from single to multi and there's no history yet, initialize with current Q&A
-	// 	if (mode === "multi" && current && !isMultiTurn(current)) {
-	// 		const initialHistory: ConversationTurn[] = [];
-	// 		if (current.question?.trim()) {
-	// 			initialHistory.push({ role: "user", content: current.question.trim() });
-	// 		}
-	// 		if (current.answer?.trim()) {
-	// 			initialHistory.push({ role: "agent", content: current.answer.trim() });
-	// 		}
-	// 		if (initialHistory.length > 0) {
-	// 			onUpdateHistory(initialHistory);
-	//
-	// 			// Assign existing references (those without messageIndex) to the first agent turn (index 1)
-	// 			// This ensures references from single-turn mode are properly associated with the agent turn
-	// 			if (current.references && current.references.length > 0 && onUpdateReference) {
-	// 				const agentMessageIndex = 1; // First agent turn is at index 1 (after user turn at 0)
-	// 				current.references.forEach(ref => {
-	// 					if (ref.messageIndex === undefined) {
-	// 						onUpdateReference(ref.id, { messageIndex: agentMessageIndex });
-	// 					}
-	// 				});
-	// 			}
-	// 		}
-	// 	}
-	//
-	// 	setEditorMode(mode);
-	// 	localStorage.setItem("gtc-editor-mode", mode);
-	// };
-
 	// Dataset-specific curation instructions (fallback to per-item or local default)
 	const datasetName = current?.datasetName;
 	const { markdown: datasetMd, error: dsError } =
 		useCurationInstructions(datasetName);
 
-	// Focus the Question textarea when a current item becomes available or changes
-	useEffect(() => {
-		if (!current?.id) return;
-		const el = questionRef.current;
-		if (!el) return;
-		if (!shouldStealFocus()) return;
-		// Focus and place caret at end for natural appending
-		el.focus();
-		moveCaretToEnd(el);
-		// eslint-disable-next-line react-hooks/exhaustive-deps
-	}, [current?.id]);
-
 	return (
-		<section className={cn("space-y-3 overflow-y-auto min-h-0", className)}>
-			{current?.deleted && (
-				<div className="rounded-2xl border border-rose-200 bg-rose-50 p-3 text-sm text-rose-900">
-					This ground truth is marked as deleted. You can restore it or leave it
-					deleted. It will remain visible in the sidebar and exports.
-				</div>
-			)}
-			<InstructionsPane
-				className=""
-				title="Curation Instructions"
-				markdown={
-					(datasetMd?.trim() ? datasetMd : current?.curationInstructions) ||
-					(defaultCurateMd as unknown as string)
-				}
-			/>
-			{dsError && datasetName && (
-				<div className="-mt-2 text-xs text-amber-700">
-					Using default instructions — couldn't load dataset instructions for
-					<span className="ml-1 font-medium">{datasetName}</span>.
-				</div>
-			)}
-
-			{/* Mode Toggle - DISABLED: Always use multi-turn mode */}
-			{/* <div className="flex items-center gap-2 rounded-2xl border bg-white p-2">
-				<button
-					type="button"
-					onClick={() => handleModeToggle("single")}
-					className={cn(
-						"flex-1 rounded-xl px-4 py-2 text-sm font-medium transition-colors",
-						editorMode === "single"
-							? "bg-violet-600 text-white shadow"
-							: "text-slate-600 hover:bg-slate-100",
-					)}
-				>
-					Single-turn
-				</button>
-				<button
-					type="button"
-					onClick={() => handleModeToggle("multi")}
-					className={cn(
-						"flex-1 rounded-xl px-4 py-2 text-sm font-medium transition-colors",
-						editorMode === "multi"
-							? "bg-violet-600 text-white shadow"
-							: "text-slate-600 hover:bg-slate-100",
-					)}
-				>
-					Multi-turn
-				</button>
-			</div> */}
-
-			{/* Editor: Single-turn or Multi-turn */}
-			{editorMode === "single" ? (
-				<>
-					<div className="rounded-2xl border bg-white p-4 shadow-sm">
-						<div className="mb-3 text-sm font-medium">Question</div>
-						<textarea
-							ref={questionRef}
-							aria-label="Question"
-							className="h-24 w-full resize-y rounded-xl border p-3 focus:outline-none focus:ring-2 focus:ring-violet-300"
-							value={current?.question || ""}
-							onChange={(e) => onUpdateQuestion(e.target.value)}
-						/>
+		<div className={cn("flex min-h-0 gap-3", className)}>
+			{/* Left pane: conversation editor, approval, actions */}
+			<section className="flex-1 min-w-0 space-y-3 overflow-y-auto min-h-0">
+				{current?.deleted && (
+					<div className="rounded-2xl border border-rose-200 bg-rose-50 p-3 text-sm text-rose-900">
+						This ground truth is marked as deleted. You can restore it or leave
+						it deleted. It will remain visible in the sidebar and exports.
 					</div>
-
-					<div className="rounded-2xl border bg-white p-4 shadow-sm">
-						<div className="mb-3 text-sm font-medium">Answer</div>
-						<textarea
-							aria-label="Answer"
-							className="h-48 w-full resize-y rounded-xl border p-3 focus:outline-none focus:ring-2 focus:ring-violet-300"
-							value={current?.answer || ""}
-							onChange={(e) => onUpdateAnswer(e.target.value)}
-						/>
+				)}
+				<InstructionsPane
+					className=""
+					title="Curation Instructions"
+					markdown={
+						(datasetMd?.trim() ? datasetMd : current?.curationInstructions) ||
+						(defaultCurateMd as unknown as string)
+					}
+				/>
+				{dsError && datasetName && (
+					<div className="-mt-2 text-xs text-amber-700">
+						Using default instructions — couldn't load dataset instructions for
+						<span className="ml-1 font-medium">{datasetName}</span>.
 					</div>
-				</>
-			) : (
+				)}
+
+				{/* Multi-turn conversation editor */}
 				<div className="rounded-2xl border bg-white p-4 shadow-sm">
 					<MultiTurnEditor
 						current={current || null}
 						onUpdateHistory={onUpdateHistory}
 						onDeleteTurn={onDeleteTurn}
-						onGenerate={onGenerateAgentTurn}
 						canEdit={!current?.deleted}
-						onUpdateReference={onUpdateReference}
-						onRemoveReference={onRemoveReference}
-						onOpenReference={onOpenReference}
-						onAddReferences={onAddReferences}
 						onUpdateTags={onUpdateTags}
 					/>
 				</div>
-			)}
-
-			{editorMode === "single" && (
-				<TagsEditor
-					selected={current?.manualTags || []}
-					computedTags={current?.computedTags}
-					onChange={onUpdateTags}
-					title="Tags"
-				/>
-			)}
 
-			{/* Comments Panel */}
-			<div className="rounded-2xl border bg-white p-4 shadow-sm">
-				<div className="mb-1 flex items-center gap-2">
-					<div className="text-sm font-medium">Comments</div>
-					<span className="ml-1 rounded-full border px-2 py-0.5 text-xs text-slate-500">
-						Optional
-					</span>
+				{/* Comments Panel */}
+				<div className="rounded-2xl border bg-white p-4 shadow-sm">
+					<div className="mb-1 flex items-center gap-2">
+						<div className="text-sm font-medium">Comments</div>
+						<span className="ml-1 rounded-full border px-2 py-0.5 text-xs text-slate-500">
+							Optional
+						</span>
+					</div>
+					<textarea
+						aria-label="Comments"
+						placeholder="Add curator notes (optional)"
+						className="h-28 w-full resize-y rounded-xl border p-3 focus:outline-none focus:ring-2 focus:ring-violet-300"
+						value={current?.comment || ""}
+						onChange={(e) => onUpdateComment(e.target.value)}
+					/>
 				</div>
-				<textarea
-					aria-label="Comments"
-					placeholder="Add curator notes (optional)"
-					className="h-28 w-full resize-y rounded-xl border p-3 focus:outline-none focus:ring-2 focus:ring-violet-300"
-					value={current?.comment || ""}
-					onChange={(e) => onUpdateComment(e.target.value)}
-				/>
-			</div>
 
-			{/* Approval Requirements Explanation */}
-			{current && !canApprove && (
-				<div className="rounded-2xl border border-amber-200 bg-amber-50 p-4">
-					<h3 className="mb-2 text-sm font-semibold text-amber-900">
-						⚠️ Issues Preventing Approval
-					</h3>
-					<div className="space-y-2 text-sm text-amber-800">
-						{current.deleted && (
-							<p className="flex items-start gap-2">
-								<span className="mt-0.5 flex-shrink-0">✗</span>
-								<span>
-									<strong>Item is deleted:</strong> Restore the item before
-									approving
-								</span>
-							</p>
-						)}
-						{editorMode === "multi" ? (
-							<>
-								{(() => {
-									// Validate conversation pattern
-									const patternValidation = validateConversationPattern(
-										current.history,
-									);
-									if (!patternValidation.valid) {
-										return (
-											<>
-												{patternValidation.errors.map((error) => (
-													<p key={error} className="flex items-start gap-2">
-														<span className="mt-0.5 flex-shrink-0">✗</span>
-														<span>
-															<strong>Conversation pattern error:</strong>{" "}
-															{error}
-														</span>
-													</p>
-												))}
-											</>
-										);
-									}
-									return null;
-								})()}
-								{(() => {
-									// Check that all agent turns have expected behavior
-									const agentTurns = (current.history || []).filter(
-										(turn) => turn.role === "agent",
-									);
-									const agentTurnsWithoutBehavior = agentTurns.filter(
-										(turn) =>
-											!turn.expectedBehavior ||
-											turn.expectedBehavior.length === 0,
-									);
-									if (agentTurnsWithoutBehavior.length > 0) {
-										return (
-											<p className="flex items-start gap-2">
-												<span className="mt-0.5 flex-shrink-0">✗</span>
-												<span>
-													<strong>Missing expected behavior:</strong>{" "}
-													{agentTurnsWithoutBehavior.length} agent turn
-													{agentTurnsWithoutBehavior.length !== 1 ? "s" : ""}{" "}
-													need at least one expected behavior selected
-												</span>
-											</p>
-										);
-									}
-									return null;
-								})()}
-							</>
-						) : (
-							<>
-								{(() => {
-									const hasSelected = current.references.length > 0;
-									if (!hasSelected) {
-										return (
-											<p className="flex items-start gap-2">
-												<span className="mt-0.5 flex-shrink-0">✗</span>
-												<span>
-													<strong>No references:</strong> Select at least one
-													reference
-												</span>
-											</p>
-										);
-									}
-									return null;
-								})()}
-								{(() => {
-									const unvisitedRefs = (current.references || []).filter(
-										(r) => !r.visitedAt,
-									);
-									if (unvisitedRefs.length > 0) {
-										return (
-											<p className="flex items-start gap-2">
+				{/* Approval Requirements Explanation */}
+				{current && !canApprove && (
+					<div className="rounded-2xl border border-amber-200 bg-amber-50 p-4">
+						<h3 className="mb-2 text-sm font-semibold text-amber-900">
+							⚠️ Issues Preventing Approval
+						</h3>
+						<div className="space-y-2 text-sm text-amber-800">
+							{current.deleted && (
+								<p className="flex items-start gap-2">
+									<span className="mt-0.5 flex-shrink-0">✗</span>
+									<span>
+										<strong>Item is deleted:</strong> Restore the item before
+										approving
+									</span>
+								</p>
+							)}
+							{validateConversationPattern(current.history).valid
+								? null
+								: validateConversationPattern(current.history).errors.map(
+										(error) => (
+											<p key={error} className="flex items-start gap-2">
 												<span className="mt-0.5 flex-shrink-0">✗</span>
 												<span>
-													<strong>Unvisited references:</strong>{" "}
-													{unvisitedRefs.length} reference
-													{unvisitedRefs.length !== 1 ? "s" : ""} need to be
-													opened and reviewed
+													<strong>Conversation pattern error:</strong> {error}
 												</span>
 											</p>
-										);
-									}
-									return null;
-								})()}
-							</>
-						)}
+										),
+									)}
+							{/* Expected tools gating: show missing required tools */}
+							{validateExpectedTools(current).errors.map((error) => (
+								<p key={error} className="flex items-start gap-2">
+									<span className="mt-0.5 flex-shrink-0">✗</span>
+									<span>
+										<strong>Expected tool not called:</strong>{" "}
+										{error
+											.replace(/^Required tool /, "")
+											.replace(/ was not called$/, "")}
+									</span>
+								</p>
+							))}
+						</div>
 					</div>
-				</div>
-			)}
-			{current && canApprove && (
-				<div className="rounded-2xl border border-emerald-200 bg-emerald-50 p-4">
-					<h3 className="mb-2 text-sm font-semibold text-emerald-900">
-						✓ Ready for Approval
-					</h3>
-					<p className="text-sm text-emerald-800">
-						All requirements are met. You can approve this item.
-					</p>
-				</div>
-			)}
+				)}
+				{current && canApprove && (
+					<div className="rounded-2xl border border-emerald-200 bg-emerald-50 p-4">
+						<h3 className="mb-2 text-sm font-semibold text-emerald-900">
+							✓ Ready for Approval
+						</h3>
+						<p className="text-sm text-emerald-800">
+							All requirements are met. You can approve this item.
+						</p>
+					</div>
+				)}
 
-			<div className="flex items-center gap-2">
-				<button
-					type="button"
-					onClick={onSaveDraft}
-					disabled={saving}
-					className="inline-flex items-center gap-2 rounded-2xl border bg-white px-4 py-2 hover:bg-violet-50 disabled:opacity-50"
-				>
-					<Save className="h-4 w-4" /> {saving ? "Saving…" : "Save Draft"}
-				</button>
-				<button
-					type="button"
-					onClick={onApprove}
-					disabled={saving || !!current?.deleted || !canApprove}
-					className="inline-flex items-center gap-2 rounded-2xl border border-violet-300 bg-violet-600 px-4 py-2 text-white shadow hover:bg-violet-700 disabled:opacity-50"
-				>
-					<Check className="h-4 w-4" /> {saving ? "Saving…" : "Approve"}
-				</button>
-				<button
-					type="button"
-					onClick={onDuplicate}
-					disabled={!current || saving}
-					className="inline-flex items-center gap-2 rounded-2xl border border-slate-300 bg-white px-4 py-2 text-slate-700 hover:bg-slate-50 disabled:opacity-50"
-					title="Create a new draft rephrasing of this item"
-				>
-					{/* Using RefreshCw icon to indicate 'create a variant'; could be Copy icon if added */}
-					<RefreshCw className="h-4 w-4" /> Duplicate
-				</button>
-				<button
-					type="button"
-					onClick={onSkip}
-					className="inline-flex items-center gap-2 rounded-2xl border bg-white px-4 py-2 hover:bg-violet-50"
-					title="Skip this ground truth (mark skipped) and move to the next"
-				>
-					Skip
-				</button>
-				{current && !current.deleted && (
+				<div className="flex items-center gap-2">
 					<button
 						type="button"
-						onClick={onDelete}
-						className="ml-auto inline-flex items-center gap-2 rounded-2xl border border-rose-300 bg-white px-4 py-2 text-rose-700 hover:bg-rose-50"
-						title="Soft delete this ground truth"
+						onClick={onSaveDraft}
+						disabled={saving}
+						className="inline-flex items-center gap-2 rounded-2xl border bg-white px-4 py-2 hover:bg-violet-50 disabled:opacity-50"
 					>
-						<Trash2 className="h-4 w-4" /> Delete
+						<Save className="h-4 w-4" /> {saving ? "Saving…" : "Save Draft"}
 					</button>
-				)}
-				{current?.deleted && (
 					<button
 						type="button"
-						onClick={onRestore}
-						className="ml-auto inline-flex items-center gap-2 rounded-2xl border border-emerald-300 bg-white px-4 py-2 text-emerald-700 hover:bg-emerald-50"
-						title="Restore this ground truth"
+						onClick={onApprove}
+						disabled={saving || !!current?.deleted || !canApprove}
+						className="inline-flex items-center gap-2 rounded-2xl border border-violet-300 bg-violet-600 px-4 py-2 text-white shadow hover:bg-violet-700 disabled:opacity-50"
 					>
-						<RefreshCw className="h-4 w-4" /> Restore
+						<Check className="h-4 w-4" /> {saving ? "Saving…" : "Approve"}
 					</button>
-				)}
-			</div>
-		</section>
+					<button
+						type="button"
+						onClick={onDuplicate}
+						disabled={!current || saving}
+						className="inline-flex items-center gap-2 rounded-2xl border border-slate-300 bg-white px-4 py-2 text-slate-700 hover:bg-slate-50 disabled:opacity-50"
+						title="Create a new draft rephrasing of this item"
+					>
+						{/* Using RefreshCw icon to indicate 'create a variant'; could be Copy icon if added */}
+						<RefreshCw className="h-4 w-4" /> Duplicate
+					</button>
+					<button
+						type="button"
+						onClick={onSkip}
+						className="inline-flex items-center gap-2 rounded-2xl border bg-white px-4 py-2 hover:bg-violet-50"
+						title="Skip this ground truth (mark skipped) and move to the next"
+					>
+						Skip
+					</button>
+					{current && !current.deleted && (
+						<button
+							type="button"
+							onClick={onDelete}
+							className="ml-auto inline-flex items-center gap-2 rounded-2xl border border-rose-300 bg-white px-4 py-2 text-rose-700 hover:bg-rose-50"
+							title="Soft delete this ground truth"
+						>
+							<Trash2 className="h-4 w-4" /> Delete
+						</button>
+					)}
+					{current?.deleted && (
+						<button
+							type="button"
+							onClick={onRestore}
+							className="ml-auto inline-flex items-center gap-2 rounded-2xl border border-emerald-300 bg-white px-4 py-2 text-emerald-700 hover:bg-emerald-50"
+							title="Restore this ground truth"
+						>
+							<RefreshCw className="h-4 w-4" /> Restore
+						</button>
+					)}
+				</div>
+			</section>
+		</div>
 	);
 }
diff --git a/frontend/src/components/app/pages/QuestionsList.tsx b/frontend/src/components/app/pages/QuestionsList.tsx
index e4bba96..b19f9fb 100644
--- a/frontend/src/components/app/pages/QuestionsList.tsx
+++ b/frontend/src/components/app/pages/QuestionsList.tsx
@@ -1,4 +1,7 @@
-import type { GroundTruthItem } from "../../../models/groundTruth";
+import {
+	type GroundTruthItem,
+	getQueuePreview,
+} from "../../../models/groundTruth";
 import { cn } from "../../../models/utils";
 
 export default function QuestionsList({
@@ -43,9 +46,9 @@ export default function QuestionsList({
 								</div>
 								<div
 									className="text-sm font-medium truncate"
-									title={it.question}
+									title={getQueuePreview(it)}
 								>
-									{it.question || "(no question)"}
+									{getQueuePreview(it)}
 								</div>
 							</div>
 							<div className="flex items-center gap-2 flex-none">
diff --git a/frontend/src/components/app/pages/ReferencesSection.tsx b/frontend/src/components/app/pages/ReferencesSection.tsx
index e139b67..31eaf4f 100644
--- a/frontend/src/components/app/pages/ReferencesSection.tsx
+++ b/frontend/src/components/app/pages/ReferencesSection.tsx
@@ -1,9 +1,35 @@
+/**
+ * ReferencesSection — generic right-pane container.
+ *
+ * Phase 4 redesign: this component is now a generic right-pane host rather
+ * than a purely retrieval-specific panel.  It renders:
+ *
+ *  1. Evidence & Trace panel (TracePanel) — always shown when the current item
+ *     has generic agentic data (toolCalls, traceIds, metadata, feedback,
+ *     expectedTools).  This is the primary Phase 4 evidence surface.
+ *
+ *  2. RAG compatibility panel (ReferencesTabs) — shown as an opt-in section
+ *     when the item has references OR when in single-turn mode.  This surface
+ *     keeps retrieval-specific review alive without it defining the host layout.
+ *
+ * Passing `item` is optional; when omitted only the ReferencesTabs section
+ * is rendered (backward-compatible with the existing single-turn surface).
+ */
+
 import { useRef, useState } from "react";
-import type { Reference } from "../../../models/groundTruth";
-import { urlToTitle } from "../../../models/utils";
+import type {
+	ContextEntry,
+	ExpectedTools,
+	GroundTruthItem,
+	Reference,
+} from "../../../models/groundTruth";
+import { hasEvidenceData } from "../../../models/groundTruth";
+import { cn, urlToTitle } from "../../../models/utils";
 import ReferencesTabs from "../../app/ReferencesPanel/ReferencesTabs";
+import TracePanel from "../../app/TracePanel";
 
 export default function ReferencesSection({
+	item,
 	query,
 	setQuery,
 	searching,
@@ -15,7 +41,12 @@ export default function ReferencesSection({
 	onRemoveReference,
 	onOpenReference,
 	isMultiTurn,
+	onUpdateContextEntries,
+	onUpdateExpectedTools,
 }: {
+	/** Optional: current ground truth item.  When present the evidence panel
+	 *  is rendered at the top of the right pane. */
+	item?: GroundTruthItem | null;
 	query: string;
 	setQuery: (q: string) => void;
 	searching: boolean;
@@ -27,11 +58,20 @@ export default function ReferencesSection({
 	onRemoveReference: (id: string) => void;
 	onOpenReference: (ref: Reference) => void;
 	isMultiTurn?: boolean;
+	onUpdateContextEntries?: (entries: ContextEntry[]) => void;
+	onUpdateExpectedTools?: (tools: ExpectedTools) => void;
 }) {
 	const [rightTab, setRightTab] = useState<"search" | "selected">("search");
 	const [searchSelected, setSearchSelected] = useState<Set<string>>(new Set());
 	const searchInputRef = useRef<HTMLInputElement | null>(null);
 
+	// RAG compat surface: only show ReferencesTabs in single-turn mode.
+	// Multi-turn items manage references per-turn via the conversation editor.
+	const showRagCompat = !isMultiTurn;
+
+	// Evidence panel: show TracePanel when item has generic agentic data.
+	const showEvidence = !!item && hasEvidenceData(item);
+
 	async function runSearch() {
 		try {
 			await onRunSearch();
@@ -72,33 +112,82 @@ export default function ReferencesSection({
 		searchInputRef.current?.focus();
 	}
 
+	// Nothing to show
+	if (!showEvidence && !showRagCompat) {
+		return (
+			<aside className="self-start h-[calc(100vh-5.5rem)] rounded-2xl border bg-slate-50 p-4 flex items-center justify-center text-sm text-slate-400 shadow-sm">
+				No evidence or references available.
+			</aside>
+		);
+	}
+
 	return (
-		<ReferencesTabs
-			rightTab={rightTab}
-			setRightTab={setRightTab}
-			query={query}
-			setQuery={setQuery}
-			searching={searching}
-			searchResults={searchResults}
-			searchSelected={searchSelected}
-			onRunSearch={runSearch}
-			onToggleSearchSelect={toggleSelectSearchResult}
-			onAddSelectedFromResults={addSelectedFromResults}
-			onAddSingleResult={(ref) => addRefsFromResults([ref])}
-			searchInputRef={searchInputRef}
-			references={references}
-			onUpdateReference={onUpdateReference}
-			onRemoveReference={(refId) => {
-				const r = (references || []).find((x) => x.id === refId);
-				const name = r?.title || (r ? urlToTitle(r.url) : "reference");
-				if (
-					window.confirm(`Remove reference "${name}"? You can Undo for 8s.`)
-				) {
-					onRemoveReference(refId);
-				}
-			}}
-			onOpenReference={onOpenReference}
-			isMultiTurn={isMultiTurn}
-		/>
+		<aside
+			className={cn(
+				"self-start flex flex-col overflow-hidden",
+				// When showing evidence, let TracePanel provide its own container styling.
+				// Only add border/bg when RAG panel is the primary surface.
+				showEvidence
+					? "max-h-[calc(100vh-5.5rem)]"
+					: "rounded-2xl border bg-white shadow-sm h-[calc(100vh-5.5rem)]",
+			)}
+		>
+			{/* Evidence & Trace panel (generic agentic data) */}
+			{showEvidence && item && (
+				<div
+					className={cn(
+						"overflow-y-auto",
+						showRagCompat ? "flex-none border-b max-h-[50%]" : "flex-1",
+					)}
+				>
+					<TracePanel
+						item={item}
+						onUpdateContextEntries={onUpdateContextEntries}
+						onUpdateExpectedTools={onUpdateExpectedTools}
+						onAddReferences={onAddRefs}
+						onOpenReference={onOpenReference}
+						onUpdateReference={onUpdateReference}
+						onRemoveReference={onRemoveReference}
+					/>
+				</div>
+			)}
+
+			{/* RAG references panel — retrieval search and selected references */}
+			{showRagCompat && (
+				<div className="flex flex-col flex-1 min-h-0">
+					<div className="flex-1 min-h-0 overflow-hidden">
+						<ReferencesTabs
+							rightTab={rightTab}
+							setRightTab={setRightTab}
+							query={query}
+							setQuery={setQuery}
+							searching={searching}
+							searchResults={searchResults}
+							searchSelected={searchSelected}
+							onRunSearch={runSearch}
+							onToggleSearchSelect={toggleSelectSearchResult}
+							onAddSelectedFromResults={addSelectedFromResults}
+							onAddSingleResult={(ref) => addRefsFromResults([ref])}
+							searchInputRef={searchInputRef}
+							references={references}
+							onUpdateReference={onUpdateReference}
+							onRemoveReference={(refId) => {
+								const r = (references || []).find((x) => x.id === refId);
+								const name = r?.title || (r ? urlToTitle(r.url) : "reference");
+								if (
+									window.confirm(
+										`Remove reference "${name}"? You can Undo for 8s.`,
+									)
+								) {
+									onRemoveReference(refId);
+								}
+							}}
+							onOpenReference={onOpenReference}
+							isMultiTurn={isMultiTurn}
+						/>
+					</div>
+				</div>
+			)}
+		</aside>
 	);
 }
diff --git a/frontend/src/components/app/pages/StatsPage.tsx b/frontend/src/components/app/pages/StatsPage.tsx
index f2fafa8..621b43c 100644
--- a/frontend/src/components/app/pages/StatsPage.tsx
+++ b/frontend/src/components/app/pages/StatsPage.tsx
@@ -1,4 +1,5 @@
 import { useEffect, useState } from "react";
+import { shouldUseDemoProvider } from "../../../config/demo";
 import type { GroundTruthItem } from "../../../models/groundTruth";
 import {
 	getGroundTruthStats,
@@ -20,7 +21,7 @@ export default function StatsPage({
 	useEffect(() => {
 		let cancelled = false;
 		(async () => {
-			if (demoMode) {
+			if (demoMode && shouldUseDemoProvider()) {
 				const data = await mockGetGroundTruthStats();
 				if (!cancelled) setStats(data);
 			} else {
diff --git a/frontend/src/components/modals/InspectItemModal.tsx b/frontend/src/components/modals/InspectItemModal.tsx
index 680b029..b120863 100644
--- a/frontend/src/components/modals/InspectItemModal.tsx
+++ b/frontend/src/components/modals/InspectItemModal.tsx
@@ -4,8 +4,6 @@ import { useGroundTruthCache } from "../../hooks/useGroundTruthCache";
 import useModalKeys from "../../hooks/useModalKeys";
 import type { GroundTruthItem } from "../../models/groundTruth";
 import { getGroundTruth } from "../../services/groundTruths";
-import { getRuntimeConfig } from "../../services/runtimeConfig";
-import { validateReferenceUrl } from "../../utils/urlValidation";
 import MultiTurnEditor from "../app/editor/MultiTurnEditor";
 import type { QuestionsExplorerItem } from "../app/QuestionsExplorer";
 import TagChip from "../common/TagChip";
@@ -22,9 +20,6 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 	);
 	const [isLoading, setIsLoading] = useState(false);
 	const [loadError, setLoadError] = useState<string | null>(null);
-	const [trustedReferenceDomains, setTrustedReferenceDomains] = useState<
-		string[]
-	>([]);
 
 	const cache = useGroundTruthCache();
 
@@ -40,36 +35,35 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 		if (!isOpen || !item) {
 			setCompleteItem(null);
 			setLoadError(null);
+			setIsLoading(false);
 			return;
 		}
 
-		// Load trusted domains for reference opening
-		getRuntimeConfig()
-			.then((cfg) => {
-				setTrustedReferenceDomains(cfg.trustedReferenceDomains ?? []);
-			})
-			.catch(() => {
-				setTrustedReferenceDomains([]);
-			});
-
 		// Validate required fields before proceeding
 		if (!item.datasetName || !item.bucket || !item.id) {
 			setLoadError("Missing required item identifiers");
 			setCompleteItem(item);
+			setIsLoading(false);
 			return;
 		}
 
+		const { datasetName, bucket, id } = item;
+
 		// Check cache first (FR-001: in-memory session cache)
-		const cachedItem = cache.get(item.datasetName, item.bucket, item.id);
+		const cachedItem = cache.get(datasetName, bucket, id);
 
 		if (cachedItem) {
 			// Use cached data to avoid redundant network call
 			setCompleteItem(cachedItem);
+			setLoadError(null);
+			setIsLoading(false);
 			return;
 		}
 
 		// Fetch fresh data if not in cache
 		// (List endpoint returns truncated data for performance, but individual endpoint has complete history)
+		const controller = new AbortController();
+		setCompleteItem(null);
 		setIsLoading(true);
 		setLoadError(null);
 
@@ -77,11 +71,14 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 			try {
 				// Fetch complete item data from individual endpoint
 				const completeItemData = await getGroundTruth(
-					item.datasetName || "",
-					item.bucket || "",
-					item.id,
+					datasetName,
+					bucket,
+					id,
+					controller.signal,
 				);
 
+				if (controller.signal.aborted) return;
+
 				if (!completeItemData) {
 					setLoadError("Item not found");
 					setCompleteItem(item); // Fallback to original
@@ -89,15 +86,11 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 				}
 
 				// Store in cache for future inspections (FR-001)
-				cache.set(
-					item.datasetName || "",
-					item.bucket || "",
-					item.id,
-					completeItemData,
-				);
+				cache.set(datasetName, bucket, id, completeItemData);
 
 				setCompleteItem(completeItemData);
 			} catch (error) {
+				if (controller.signal.aborted) return;
 				const message =
 					error instanceof Error
 						? error.message
@@ -107,8 +100,14 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 				setCompleteItem(item);
 			}
 		})().finally(() => {
-			setIsLoading(false);
+			if (!controller.signal.aborted) {
+				setIsLoading(false);
+			}
 		});
+
+		return () => {
+			controller.abort();
+		};
 	}, [isOpen, item, cache]);
 
 	if (!isOpen || !item) return null;
@@ -142,7 +141,7 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 									ID
 								</div>
 								<div className="rounded-lg border bg-slate-50 p-2 font-mono text-sm text-slate-800">
-									{item.id}
+									{displayItem.id}
 								</div>
 							</div>
 							<div>
@@ -152,18 +151,18 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 								<div className="flex gap-2">
 									<span
 										className={`inline-block rounded-full px-3 py-1 text-sm font-medium ${
-											item.status === "draft"
+											displayItem.status === "draft"
 												? "bg-amber-100 text-amber-900"
-												: item.status === "approved"
+												: displayItem.status === "approved"
 													? "bg-emerald-100 text-emerald-900"
-													: item.status === "skipped"
+													: displayItem.status === "skipped"
 														? "bg-slate-200 text-slate-800"
 														: "bg-rose-100 text-rose-900"
 										}`}
 									>
-										{item.status}
+										{displayItem.status}
 									</span>
-									{item.deleted && item.status !== "deleted" && (
+									{displayItem.deleted && displayItem.status !== "deleted" && (
 										<span className="inline-block rounded-full bg-rose-100 px-3 py-1 text-sm font-medium text-rose-900">
 											deleted
 										</span>
@@ -172,25 +171,25 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 							</div>
 						</div>
 						{/* Dataset and Bucket */}
-						{(item.datasetName || item.bucket) && (
+						{(displayItem.datasetName || displayItem.bucket) && (
 							<div className="grid grid-cols-2 gap-4">
-								{item.datasetName && (
+								{displayItem.datasetName && (
 									<div>
 										<div className="mb-1 text-xs font-medium text-slate-600">
 											Dataset
 										</div>
 										<div className="rounded-lg border bg-slate-50 p-2 text-sm text-slate-800">
-											{item.datasetName}
+											{displayItem.datasetName}
 										</div>
 									</div>
 								)}
-								{item.bucket && (
+								{displayItem.bucket && (
 									<div>
 										<div className="mb-1 text-xs font-medium text-slate-600">
 											Bucket
 										</div>
 										<div className="rounded-lg border bg-slate-50 p-2 text-sm text-slate-800">
-											{item.bucket}
+											{displayItem.bucket}
 										</div>
 									</div>
 								)}
@@ -201,6 +200,14 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 							<div className="mb-1 text-xs font-medium text-slate-600">
 								Conversation
 							</div>
+							{!isLoading && loadError && (
+								<div
+									role="alert"
+									className="mb-3 rounded-lg border border-amber-200 bg-amber-50 p-3 text-sm text-amber-800"
+								>
+									{loadError}
+								</div>
+							)}
 							<div className="rounded-lg border bg-white min-h-[200px]">
 								{isLoading ? (
 									<div className="flex items-center justify-center p-8">
@@ -208,10 +215,6 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 											Loading complete conversation...
 										</div>
 									</div>
-								) : loadError ? (
-									<div className="flex items-center justify-center p-8">
-										<div className="text-sm text-red-600">{loadError}</div>
-									</div>
 								) : (
 									<MultiTurnEditor
 										current={displayItem}
@@ -219,85 +222,48 @@ export default function InspectItemModal({ isOpen, item, onClose }: Props) {
 										canEdit={false}
 										onUpdateHistory={() => {}}
 										onDeleteTurn={() => {}}
-										onGenerate={() =>
-											Promise.resolve({ ok: false, error: "Read-only mode" })
-										}
-										onUpdateReference={() => {}}
-										onRemoveReference={() => {}}
-										// Secure reference opening with validation and user confirmation
-										onOpenReference={(ref) => {
-											if (!validateReferenceUrl(ref.url)) {
-												alert(
-													"This reference contains an unsafe URL and cannot be opened.",
-												);
-												return;
-											}
-
-											// For external or untrusted URLs, show user confirmation
-											const parsedUrl = new URL(ref.url);
-											const hostname = parsedUrl.hostname.toLowerCase();
-											const sameOrigin = hostname === window.location.hostname;
-											const isTrusted =
-												trustedReferenceDomains.includes(hostname);
-											const isExternal = !sameOrigin && !isTrusted;
-
-											if (isExternal) {
-												const confirmed = confirm(
-													`You are about to visit an external website:\n\n${parsedUrl.hostname}\n\nDo you want to continue?`,
-												);
-												if (!confirmed) {
-													return;
-												}
-											}
-
-											// Open with security attributes
-											window.open(
-												ref.url,
-												"_blank",
-												"noopener,noreferrer,nofollow",
-											);
-										}}
 										onUpdateTags={() => {}}
 									/>
 								)}
 							</div>
 						</div>{" "}
 						{/* Tags */}
-						{((item.manualTags && item.manualTags.length > 0) ||
-							(item.computedTags && item.computedTags.length > 0)) && (
+						{((displayItem.manualTags && displayItem.manualTags.length > 0) ||
+							(displayItem.computedTags &&
+								displayItem.computedTags.length > 0)) && (
 							<div>
 								<div className="mb-1 text-xs font-medium text-slate-600">
 									Tags
 								</div>
 								<div className="flex flex-wrap gap-2">
-									{item.computedTags?.map((tag) => (
+									{displayItem.computedTags?.map((tag) => (
 										<TagChip key={`computed-${tag}`} tag={tag} isComputed />
 									))}
-									{item.manualTags?.map((tag) => (
+									{displayItem.manualTags?.map((tag) => (
 										<TagChip key={`manual-${tag}`} tag={tag} />
 									))}
 								</div>
 							</div>
 						)}
 						{/* Comment */}
-						{item.comment && (
+						{displayItem.comment && (
 							<div>
 								<div className="mb-1 text-xs font-medium text-slate-600">
 									Comment
 								</div>
 								<div className="rounded-lg border bg-white p-3 text-sm text-slate-700 whitespace-pre-wrap">
-									{item.comment}
+									{displayItem.comment}
 								</div>
 							</div>
 						)}
 						{/* Metadata */}
-						{item.reviewedAt && (
+						{displayItem.reviewedAt && (
 							<div>
 								<div className="mb-1 text-xs font-medium text-slate-600">
 									Reviewed At
 								</div>
 								<div className="rounded-lg border bg-slate-50 p-2 text-sm text-slate-800">
-									{new Date(item.reviewedAt).toLocaleString()}
+									{new Date(displayItem.reviewedAt).toLocaleString()}
 								</div>
 							</div>
 						)}
diff --git a/frontend/src/config/demo.ts b/frontend/src/config/demo.ts
index 44c5170..e3c0bda 100644
--- a/frontend/src/config/demo.ts
+++ b/frontend/src/config/demo.ts
@@ -1,5 +1,6 @@
 /** Demo mode configuration helper.
- * Reads DEMO_MODE (or VITE_DEMO_MODE as fallback) and exposes a boolean flag.
+ * Reads DEMO_MODE (or VITE_DEMO_MODE as fallback) and exposes demo affordances.
+ * Explicit `json` keeps the frontend-only provider; truthy values default to API-backed demo mode.
  */
 
 function normalize(v: unknown): string {
@@ -17,19 +18,30 @@ const RAW_VITE_DEMO_MODE = import.meta.env.VITE_DEMO_MODE as unknown as
 	| string
 	| undefined;
 const DEMO_MODE_VALUE: string = RAW_DEMO_MODE ?? RAW_VITE_DEMO_MODE ?? "";
+const NORMALIZED_DEMO_MODE = normalize(DEMO_MODE_VALUE);
 
-const DEMO_MODE: boolean = ["1", "true", "yes", "on"].includes(
-	normalize(DEMO_MODE_VALUE),
-);
+const DEMO_MODE: boolean =
+	NORMALIZED_DEMO_MODE.length > 0 &&
+	!["0", "false", "no", "off"].includes(NORMALIZED_DEMO_MODE);
+
+export type DemoDataSource = "api" | "json";
+
+export function getDemoDataSource(): DemoDataSource | null {
+	if (!DEMO_MODE) return null;
+	return ["json", "local", "static"].includes(NORMALIZED_DEMO_MODE)
+		? "json"
+		: "api";
+}
 
 /**
  * Determines if the demo provider should be used.
- * Demo mode is only active in development builds when DEMO_MODE is enabled.
+ * Frontend-only demo data is only active in development builds when demo mode
+ * explicitly requests the JSON provider.
  * This function is extracted to enable testing of the gating logic.
  */
 export function shouldUseDemoProvider(): boolean {
 	const inDevBuild = !!import.meta.env.DEV;
-	return inDevBuild && DEMO_MODE;
+	return inDevBuild && getDemoDataSource() === "json";
 }
 
 /**
diff --git a/frontend/src/demo.tsx b/frontend/src/demo.tsx
index cbb1c9f..8c2917b 100644
--- a/frontend/src/demo.tsx
+++ b/frontend/src/demo.tsx
@@ -1,22 +1,22 @@
-import { useEffect, useState } from "react";
+import { lazy, Suspense, useCallback, useEffect, useState } from "react";
 import AppHeader from "./components/app/AppHeader";
 import InstructionsPane from "./components/app/InstructionsPane";
+import EvidenceDrawer from "./components/app/layout/EvidenceDrawer";
+import SplitPaneLayout from "./components/app/layout/SplitPaneLayout";
 import CuratePane from "./components/app/pages/CuratePane";
 import ReferencesSection from "./components/app/pages/ReferencesSection";
-import StatsPage from "./components/app/pages/StatsPage";
 import type { QuestionsExplorerItem } from "./components/app/QuestionsExplorer";
-import QuestionsExplorer from "./components/app/QuestionsExplorer";
 import QueueSidebar from "./components/app/QueueSidebar";
 import Toasts from "./components/common/Toasts";
-import InspectItemModal from "./components/modals/InspectItemModal";
-import TagGlossaryModal from "./components/modals/TagGlossaryModal";
 import DEMO_MODE from "./config/demo";
 import { runSelfTests } from "./dev/self-tests";
 import useGlobalHotkeys from "./hooks/useGlobalHotkeys";
 import useGroundTruth from "./hooks/useGroundTruth";
+import { invalidateGroundTruthCache } from "./hooks/useGroundTruthCache";
 import { useToasts } from "./hooks/useToasts";
 import type { Reference } from "./models/groundTruth";
-import { cn, normalizeUrl } from "./models/utils";
+import { getItemReferences } from "./models/groundTruth";
+import { normalizeUrl } from "./models/utils";
 import {
 	assignItem,
 	requestAssignmentsSelfServe,
@@ -30,6 +30,105 @@ import { getCachedConfig, getRuntimeConfig } from "./services/runtimeConfig";
 import { fetchTagSchema } from "./services/tags";
 import { validateReferenceUrl } from "./utils/urlValidation";
 
+const DESKTOP_CURATE_QUERY = "(min-width: 1024px)";
+
+const StatsPage = lazy(() => import("./components/app/pages/StatsPage"));
+const QuestionsExplorer = lazy(
+	() => import("./components/app/QuestionsExplorer"),
+);
+const InspectItemModal = lazy(
+	() => import("./components/modals/InspectItemModal"),
+);
+const TagGlossaryModal = lazy(
+	() => import("./components/modals/TagGlossaryModal"),
+);
+
+function PageFallback({ label }: { label: string }) {
+	return (
+		<section className="flex flex-1 items-center justify-center rounded-2xl border bg-white p-4 text-sm text-slate-600 shadow-sm min-h-0 min-w-0">
+			{label}
+		</section>
+	);
+}
+
+function PanelFallback({ label }: { label: string }) {
+	return (
+		<div className="flex flex-1 items-center justify-center rounded-2xl border bg-white p-4 text-sm text-slate-600 shadow-sm min-h-0 min-w-0">
+			{label}
+		</div>
+	);
+}
+
+function ModalFallback({ label }: { label: string }) {
+	return (
+		<div className="fixed inset-0 z-50 grid place-items-center bg-black/40 p-4">
+			<div className="w-full max-w-md rounded-2xl border bg-white p-6 text-sm text-slate-600 shadow-xl">
+				{label}
+			</div>
+		</div>
+	);
+}
+
+function getMediaQueryMatch(query: string) {
+	if (typeof window === "undefined") {
+		return false;
+	}
+
+	return window.matchMedia(query).matches;
+}
+
+function useMediaQuery(query: string) {
+	const [matches, setMatches] = useState(() => getMediaQueryMatch(query));
+
+	useEffect(() => {
+		if (typeof window === "undefined") {
+			return;
+		}
+
+		const mediaQueryList = window.matchMedia(query);
+		const handleChange = () => {
+			setMatches(mediaQueryList.matches);
+		};
+
+		handleChange();
+		mediaQueryList.addEventListener("change", handleChange);
+
+		return () => {
+			mediaQueryList.removeEventListener("change", handleChange);
+		};
+	}, [query]);
+
+	return matches;
+}
+
+export function invalidateInspectCacheForExplorerItem(
+	item: Pick<QuestionsExplorerItem, "datasetName" | "bucket" | "id">,
+) {
+	if (item.datasetName && item.bucket && item.id) {
+		invalidateGroundTruthCache(item.datasetName, item.bucket, item.id);
+	}
+}
+
+export async function resolveExplorerAssignSelection(
+	itemId: string,
+	selectItem: (itemId: string) => Promise<boolean>,
+) {
+	const selected = await selectItem(itemId);
+	if (!selected) {
+		return {
+			switchToCurate: false,
+			toastKind: "info" as const,
+			toastMessage: `Assigned ${itemId}, but opening it in curate was cancelled or failed.`,
+		};
+	}
+
+	return {
+		switchToCurate: true,
+		toastKind: "success" as const,
+		toastMessage: `Assigned ${itemId} for curation`,
+	};
+}
+
 export default function GTAppDemo() {
 	const [sidebarOpen, setSidebarOpen] = useState<boolean>(true);
 	const [inspectItem, setInspectItem] = useState<QuestionsExplorerItem | null>(
@@ -41,15 +140,17 @@ export default function GTAppDemo() {
 		"curate",
 	);
 	const [selfServeBusy, setSelfServeBusy] = useState(false);
-	// Track the current editor mode (single-turn or multi-turn)
-	const [editorMode, setEditorMode] = useState<"single" | "multi">("single");
+	const [drawerOpen, setDrawerOpen] = useState(false);
+	const closeDrawer = useCallback(() => setDrawerOpen(false), []);
+	const isDesktop = useMediaQuery(DESKTOP_CURATE_QUERY);
 
 	// Feature hook
 	const gt = useGroundTruth();
 
-	// Initialize runtime config on app startup
+	// Warm the runtime-config store on app startup. The service load is cached and
+	// idempotent, so this stays safe under StrictMode re-renders.
 	useEffect(() => {
-		getRuntimeConfig().catch((err) => {
+		void getRuntimeConfig().catch((err) => {
 			console.warn("Failed to load runtime config:", err);
 		});
 	}, []);
@@ -92,14 +193,6 @@ export default function GTAppDemo() {
 		}
 	}
 
-	async function onGenerateAgentTurn(messageIndex: number) {
-		const result = await gt.generateAgentTurn(messageIndex);
-		if (!result.ok) {
-			toast("error", result.error);
-		}
-		return result;
-	}
-
 	async function onSave(nextStatus?: "draft" | "approved") {
 		const res = await gt.save(nextStatus);
 		if (!res.ok) {
@@ -149,6 +242,92 @@ export default function GTAppDemo() {
 		}
 	}
 
+	useEffect(() => {
+		if (isDesktop) {
+			closeDrawer();
+		}
+	}, [isDesktop, closeDrawer]);
+
+	const isMultiTurn = Boolean(gt.current?.history?.length);
+	const references = gt.current ? getItemReferences(gt.current) : [];
+
+	async function onDuplicate() {
+		const res = await gt.duplicateCurrent();
+		if (res.ok) {
+			toast("success", `Created rephrase ${res.created.id} and opened it.`);
+		} else {
+			toast("error", res.error || "Duplicate failed");
+		}
+	}
+
+	async function onSkip() {
+		if (!gt.current) return;
+		const res = await gt.save("skipped");
+		if (!res.ok) return;
+
+		const index = gt.items.findIndex((item) => item.id === res.saved.id);
+		const nextItem =
+			index >= 0 && index < gt.items.length - 1
+				? gt.items[index + 1]
+				: gt.items[0];
+		if (nextItem) {
+			void gt.selectItem(nextItem.id, { force: true });
+		}
+	}
+
+	function onAddRefs(refs: Reference[]) {
+		gt.addReferences(refs);
+		toast("success", `Added ${refs.length} reference(s)`);
+	}
+
+	function onRemoveReference(refId: string) {
+		gt.removeReferenceWithUndo(refId, (undo, timeoutMs) => {
+			toast("info", "Reference removed.", {
+				duration: timeoutMs,
+				actionLabel: "Undo",
+				onAction: undo,
+			});
+		});
+	}
+
+	const curatePane = (
+		<CuratePane
+			className={isDesktop ? "h-full overflow-y-auto" : "min-h-0"}
+			current={gt.current}
+			canApprove={gt.canApprove}
+			saving={gt.saving}
+			onUpdateComment={(v) => gt.updateComment(v)}
+			onUpdateTags={(tags) => gt.updateTags(tags)}
+			onUpdateHistory={(history) => gt.updateHistory(history)}
+			onDeleteTurn={(messageIndex) => gt.deleteTurn(messageIndex)}
+			onSaveDraft={() => onSave("draft")}
+			onApprove={() => onSave("approved")}
+			onDuplicate={onDuplicate}
+			onSkip={onSkip}
+			onDelete={() => toggleDeletedFlag(true)}
+			onRestore={() => toggleDeletedFlag(false)}
+		/>
+	);
+
+	const referencesPane = (
+		<ReferencesSection
+			item={gt.current}
+			query={gt.query}
+			setQuery={gt.setQuery}
+			searching={gt.searching}
+			searchResults={gt.searchResults}
+			onRunSearch={gt.runSearch}
+			onAddRefs={onAddRefs}
+			references={references}
+			onUpdateReference={(id, partial) => gt.updateReference(id, partial)}
+			onRemoveReference={onRemoveReference}
+			onOpenReference={onOpenRef}
+			isMultiTurn={isMultiTurn}
+			onUpdateContextEntries={gt.updateContextEntries}
+			onUpdateExpectedTools={gt.updateExpectedTools}
+		/>
+	);
+
 	return (
 		<div className="flex h-screen w-screen flex-col overflow-hidden bg-gradient-to-b from-violet-50 via-white to-white text-slate-900">
 			{/* Top accent bar */}
@@ -175,11 +354,13 @@ export default function GTAppDemo() {
 
 			<main className="mx-auto flex w-full max-w-none flex-1 flex-col gap-4 p-4 min-h-0">
 				{viewMode === "stats" && (
-					<StatsPage
-						demoMode={DEMO_MODE}
-						items={gt.items}
-						onBack={() => setViewMode("curate")}
-					/>
+					<Suspense fallback={<PageFallback label="Loading stats…" />}>
+						<StatsPage
+							demoMode={DEMO_MODE}
+							items={gt.items}
+							onBack={() => setViewMode("curate")}
+						/>
+					</Suspense>
 				)}
 
 				{viewMode === "questions" && (
@@ -190,98 +371,112 @@ export default function GTAppDemo() {
 							markdown={`\n### Reviewing Questions\n\n- Scan for duplicates, out-of-scope, or low-quality questions.\n- Use Delete to soft-delete; you can restore later.\n- Open an item to curate details.\n`}
 						/>
 						<div className="flex flex-1 min-h-0 min-w-0">
-							<QuestionsExplorer
-								onAssign={async (item) => {
-									try {
-										// Validate the item has required information
-										if (!item.datasetName || !item.bucket) {
-											toast(
-												"error",
-												"Item missing dataset or bucket information",
-											);
-											return;
-										}
+							<Suspense fallback={<PanelFallback label="Loading questions…" />}>
+								<QuestionsExplorer
+									onAssign={async (item) => {
+										try {
+											// Validate the item has required information
+											if (!item.datasetName || !item.bucket) {
+												toast(
+													"error",
+													"Item missing dataset or bucket information",
+												);
+												return;
+											}
 
-										// Call the assign endpoint
-										await assignItem(item.datasetName, item.bucket, item.id);
+											// Call the assign endpoint
+											await assignItem(item.datasetName, item.bucket, item.id);
 
-										// Refresh the list to get the updated item
-										await gt.refreshList();
+											// Refresh the list to get the updated item
+											await gt.refreshList();
 
-										// Set the item as selected and switch to curate mode
-										await gt.selectItem(item.id);
-										setViewMode("curate");
-
-										toast("success", `Assigned ${item.id} for curation`);
-									} catch (error) {
-										const message =
-											error instanceof Error
-												? error.message
-												: "Failed to assign item";
-										toast("error", message);
-									}
-								}}
-								onInspect={(item) => {
-									// Show the item in the inspect modal
-									setInspectItem(item);
-								}}
-								onDelete={async (item) => {
-									try {
-										const isDeleted = item.deleted || item.status === "deleted";
-
-										// Validate required metadata
-										if (!item.datasetName) {
-											throw new Error(
-												`Item ${item.id} is missing datasetName metadata`,
-											);
-										}
-										if (!item.bucket) {
-											throw new Error(
-												`Item ${item.id} is missing bucket metadata`,
+											const selectionResult =
+												await resolveExplorerAssignSelection(
+													item.id,
+													(selectedItemId) => gt.selectItem(selectedItemId),
+												);
+											if (selectionResult.switchToCurate) {
+												setViewMode("curate");
+											}
+											toast(
+												selectionResult.toastKind,
+												selectionResult.toastMessage,
 											);
+										} catch (error) {
+											const message =
+												error instanceof Error
+													? error.message
+													: "Failed to assign item";
+											toast("error", message);
 										}
+									}}
+									onInspect={(item) => {
+										// Show the item in the inspect modal
+										setInspectItem(item);
+									}}
+									onDelete={async (item) => {
+										try {
+											const isDeleted =
+												item.deleted || item.status === "deleted";
 
-										if (isDeleted) {
-											// Restore: call the backend API directly
-											const itemWithEtag = item as typeof item & {
-												_etag?: string;
-											};
-											await restoreGroundTruth(
-												item.datasetName,
-												item.bucket,
-												item.id,
-												itemWithEtag._etag,
-											);
-											toast("success", `Restored ${item.id} to draft status.`);
-										} else {
-											// Delete: use DELETE endpoint directly
-											await deleteGroundTruth(
-												item.datasetName,
-												item.bucket,
-												item.id,
-											);
-											toast("info", `Marked ${item.id} as deleted.`);
+											// Validate required metadata
+											if (!item.datasetName) {
+												throw new Error(
+													`Item ${item.id} is missing datasetName metadata`,
+												);
+											}
+											if (!item.bucket) {
+												throw new Error(
+													`Item ${item.id} is missing bucket metadata`,
+												);
+											}
+
+											if (isDeleted) {
+												// Restore: call the backend API directly
+												const itemWithEtag = item as typeof item & {
+													_etag?: string;
+												};
+												await restoreGroundTruth(
+													item.datasetName,
+													item.bucket,
+													item.id,
+													itemWithEtag._etag,
+												);
+												toast(
+													"success",
+													`Restored ${item.id} to draft status.`,
+												);
+											} else {
+												// Delete: use DELETE endpoint directly
+												await deleteGroundTruth(
+													item.datasetName,
+													item.bucket,
+													item.id,
+												);
+												toast("info", `Marked ${item.id} as deleted.`);
+											}
+											invalidateInspectCacheForExplorerItem(item);
+											await gt.refreshList();
+										} catch (error) {
+											const message =
+												error instanceof Error
+													? error.message
+													: "Failed to update item";
+											toast("error", message);
 										}
-										await gt.refreshList();
-									} catch (error) {
-										const message =
-											error instanceof Error
-												? error.message
-												: "Failed to update item";
-										toast("error", message);
-									}
-								}}
-							/>
+									}}
+								/>
+							</Suspense>
 						</div>
 					</section>
 				)}
 
 				{viewMode === "curate" && (
-					<div className="grid grid-cols-1 md:grid-cols-12 gap-4 flex-1 min-h-0">
+					<div className="flex flex-1 gap-4 min-h-0">
 						{/* Left: Queue */}
 						{sidebarOpen && (
 							<QueueSidebar
-								className="hidden md:block col-span-1 md:col-span-4 lg:col-span-3"
+								className="hidden md:block flex-none w-64 lg:w-72"
 								items={gt.items}
 								selectedId={gt.selectedId}
 								onSelect={(id) => {
@@ -324,130 +519,56 @@ export default function GTAppDemo() {
 							/>
 						)}
 
-						{/* Center: Editor */}
-						<CuratePane
-							className={cn(
-								"col-span-1", // Mobile: full width
-								// In multi-turn mode (no references sidebar), take full remaining width
-								editorMode === "multi"
-									? sidebarOpen
-										? "md:col-span-8 lg:col-span-9"
-										: "md:col-span-12"
-									: sidebarOpen
-										? "md:col-span-8 lg:col-span-5"
-										: "md:col-span-12 lg:col-span-7",
-							)}
-							current={gt.current}
-							canApprove={gt.canApprove}
-							saving={gt.saving}
-							onUpdateQuestion={(v) => gt.updateQuestion(v)}
-							onUpdateAnswer={(v) => gt.updateAnswer(v)}
-							onUpdateComment={(v) => gt.updateComment(v)}
-							onUpdateTags={(tags) => gt.updateTags(tags)}
-							onUpdateHistory={(history) => gt.updateHistory(history)}
-							onDeleteTurn={(messageIndex) => gt.deleteTurn(messageIndex)}
-							onGenerateAgentTurn={onGenerateAgentTurn}
-							onEditorModeChange={setEditorMode}
-							onSaveDraft={() => onSave("draft")}
-							onApprove={() => onSave("approved")}
-							onUpdateReference={(refId, partial) =>
-								gt.updateReference(refId, partial)
-							}
-							onRemoveReference={(refId) => {
-								// In multi-turn mode, the modal shows its own toasts
-								gt.removeReferenceWithUndo(refId, (undo, timeoutMs) => {
-									if (editorMode === "single") {
-										toast("info", "Reference removed.", {
-											duration: timeoutMs,
-											actionLabel: "Undo",
-											onAction: undo,
-										});
-									}
-								});
-							}}
-							onOpenReference={onOpenRef}
-							onAddReferences={(refs) => {
-								gt.addReferences(refs);
-								// Toast is shown in the modal for multi-turn context
-							}}
-							onDuplicate={async () => {
-								const res = await gt.duplicateCurrent();
-								if (res.ok) {
-									toast(
-										"success",
-										`Created rephrase ${res.created.id} and opened it.`,
-									);
-								} else {
-									toast("error", res.error || "Duplicate failed");
-								}
-							}}
-							onSkip={async () => {
-								if (!gt.current) return;
-								const r = await gt.save("skipped");
-								if (!r.ok) return;
-								const idx = gt.items.findIndex((i) => i.id === r.saved.id);
-								const next =
-									idx >= 0 && idx < gt.items.length - 1
-										? gt.items[idx + 1]
-										: gt.items[0];
-								if (next) void gt.selectItem(next.id, { force: true });
-							}}
-							onDelete={() => toggleDeletedFlag(true)}
-							onRestore={() => toggleDeletedFlag(false)}
-						/>
-
-						{/* Right: References (Tabbed) - Only show in single-turn mode */}
-						{editorMode === "single" && (
-							<div
-								className={cn(
-									"hidden lg:block col-span-1",
-									sidebarOpen ? "lg:col-span-4" : "lg:col-span-5",
-								)}
-							>
-								<ReferencesSection
-									query={gt.query}
-									setQuery={gt.setQuery}
-									searching={gt.searching}
-									searchResults={gt.searchResults}
-									onRunSearch={gt.runSearch}
-									onAddRefs={(refs) => {
-										gt.addReferences(refs);
-										toast("success", `Added ${refs.length} reference(s)`);
-									}}
-									references={gt.current?.references || []}
-									onUpdateReference={(id, partial) =>
-										gt.updateReference(id, partial)
-									}
-									onRemoveReference={(refId) =>
-										gt.removeReferenceWithUndo(refId, (undo, timeoutMs) => {
-											toast("info", "Reference removed.", {
-												duration: timeoutMs,
-												actionLabel: "Undo",
-												onAction: undo,
-											});
-										})
-									}
-									onOpenReference={onOpenRef}
-									isMultiTurn={
-										!!(gt.current?.history && gt.current.history.length > 0)
+						{/* Center + Right: split-pane on desktop, editor + drawer on mobile */}
+						<div className="flex-1 min-w-0 min-h-0">
+							{isDesktop ? (
+								<SplitPaneLayout
+									className="h-full w-full"
+									left={curatePane}
+									right={
+										<div className="h-full overflow-y-auto">
+											{referencesPane}
+										</div>
 									}
 								/>
-							</div>
-						)}
+							) : (
+								<>
+									<div className="mb-2 flex justify-end">
+										<button
+											type="button"
+											onClick={() => setDrawerOpen(true)}
+											className="inline-flex items-center gap-1.5 rounded-xl border bg-white px-3 py-1.5 text-xs text-slate-600 hover:bg-violet-50 hover:text-violet-700 shadow-sm"
+										>
+											📋 Evidence
+										</button>
+									</div>
+									<div className="min-h-0">{curatePane}</div>
+									<EvidenceDrawer open={drawerOpen} onClose={closeDrawer}>
+										{referencesPane}
+									</EvidenceDrawer>
+								</>
+							)}
+						</div>
 					</div>
 				)}
 			</main>
 
 			{/* Inspect Item Modal */}
-			<InspectItemModal
-				isOpen={!!inspectItem}
-				item={inspectItem}
-				onClose={() => setInspectItem(null)}
-			/>
+			{inspectItem && (
+				<Suspense fallback={<ModalFallback label="Loading item inspector…" />}>
+					<InspectItemModal
+						isOpen={true}
+						item={inspectItem}
+						onClose={() => setInspectItem(null)}
+					/>
+				</Suspense>
+			)}
 
 			{/* Tag Glossary Modal */}
 			{glossaryOpen && (
-				<TagGlossaryModal onClose={() => setGlossaryOpen(false)} />
+				<Suspense fallback={<ModalFallback label="Loading tag glossary…" />}>
+					<TagGlossaryModal onClose={() => setGlossaryOpen(false)} />
+				</Suspense>
 			)}
 
 			{/* Toasts */}
diff --git a/frontend/src/dev/self-tests.ts b/frontend/src/dev/self-tests.ts
index 7b5b4a6..7659582 100644
--- a/frontend/src/dev/self-tests.ts
+++ b/frontend/src/dev/self-tests.ts
@@ -12,14 +12,37 @@ export function runSelfTests() {
 		keyParagraph: "",
 		...over,
 	});
-	const mk = (refs: Reference[]): GroundTruthItem => ({
-		id: "T",
-		question: "q",
-		answer: "a",
-		references: refs,
-		status: "draft",
-		providerId: "json",
-	});
+	const mk = (refs: Reference[]): GroundTruthItem => {
+		const item: GroundTruthItem = {
+			id: "T",
+			question: "q",
+			answer: "a",
+			status: "draft",
+			providerId: "json",
+		};
+		if (refs.length > 0) {
+			item.plugins = {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1.0",
+					data: {
+						retrievals: {
+							_unassociated: {
+								candidates: refs.map((r) => ({
+									url: r.url,
+									title: r.title,
+									chunk: r.snippet,
+									visitedAt: r.visitedAt,
+									keyParagraph: r.keyParagraph,
+								})),
+							},
+						},
+					},
+				},
+			};
+		}
+		return item;
+	};
 
 	console.assert(
 		refsApprovalReady(mk([])),
diff --git a/frontend/src/hooks/useCurationInstructions.ts b/frontend/src/hooks/useCurationInstructions.ts
index 9e09e67..460f92e 100644
--- a/frontend/src/hooks/useCurationInstructions.ts
+++ b/frontend/src/hooks/useCurationInstructions.ts
@@ -1,6 +1,8 @@
-import { useCallback, useEffect, useMemo, useRef, useState } from "react";
+import { useCallback, useEffect, useRef, useState } from "react";
 import { getDatasetCurationInstructions } from "../services/datasets";
 
+const instructionsCache = new Map<string, string>();
+
 /**
  * Hook to retrieve dataset-level curation instructions (markdown) with a per-session cache.
  * - Returns cached value immediately if present.
@@ -9,44 +11,89 @@ import { getDatasetCurationInstructions } from "../services/datasets";
  */
 function useCurationInstructions(datasetName?: string | null) {
 	const ds = (datasetName || "").trim();
-	const cacheRef = useRef<Record<string, string>>({});
+	const [markdown, setMarkdown] = useState<string | undefined>(() =>
+		ds ? instructionsCache.get(ds) : undefined,
+	);
 	const [loading, setLoading] = useState(false);
 	const [error, setError] = useState<string | null>(null);
-	const [version, setVersion] = useState(0); // bump to force recompute when cache changes
+	const activeControllerRef = useRef<AbortController | null>(null);
+	const requestIdRef = useRef(0);
+
+	const loadInstructions = useCallback(
+		async (forceRefresh = false) => {
+			if (!ds) return;
+
+			const cached = instructionsCache.get(ds);
+			if (!forceRefresh && typeof cached !== "undefined") {
+				setMarkdown(cached);
+				setLoading(false);
+				setError(null);
+				return;
+			}
 
-	// biome-ignore lint/correctness/useExhaustiveDependencies(version): suppress dependency version
-	const value = useMemo(() => {
-		if (!ds) return undefined;
-		// Accessing ref doesn't need to be in deps; we trigger recompute by bumping version in state
-		return cacheRef.current[ds];
-	}, [ds, version]);
+			activeControllerRef.current?.abort();
+			const controller = new AbortController();
+			activeControllerRef.current = controller;
+			const requestId = ++requestIdRef.current;
+			setLoading(true);
+			setError(null);
+
+			try {
+				const doc = await getDatasetCurationInstructions(ds, controller.signal);
+				if (requestId !== requestIdRef.current || controller.signal.aborted) {
+					return;
+				}
+				const nextMarkdown = doc?.instructions || "";
+				instructionsCache.set(ds, nextMarkdown);
+				setMarkdown(nextMarkdown);
+			} catch (e) {
+				if (controller.signal.aborted) return;
+				const msg = e instanceof Error ? e.message : String(e);
+				setError(msg);
+			} finally {
+				if (requestId === requestIdRef.current && !controller.signal.aborted) {
+					setLoading(false);
+				}
+			}
+		},
+		[ds],
+	);
 
 	const refresh = useCallback(async () => {
-		if (!ds) return;
-		setLoading(true);
-		setError(null);
-		try {
-			const doc = await getDatasetCurationInstructions(ds);
-			const md = doc?.instructions || "";
-			cacheRef.current[ds] = md;
-			setVersion((v) => v + 1);
-		} catch (e) {
-			const msg = e instanceof Error ? e.message : String(e);
-			setError(msg);
-		} finally {
+		await loadInstructions(true);
+	}, [loadInstructions]);
+
+	useEffect(() => {
+		if (!ds) {
+			activeControllerRef.current?.abort();
+			setMarkdown(undefined);
 			setLoading(false);
+			setError(null);
+			return;
 		}
-	}, [ds]);
 
-	// Auto fetch when dataset changes and not in cache yet
-	useEffect(() => {
-		if (!ds) return;
-		if (typeof cacheRef.current[ds] === "undefined") {
-			void refresh();
+		const cached = instructionsCache.get(ds);
+		setMarkdown(cached);
+		if (typeof cached !== "undefined") {
+			setLoading(false);
+			setError(null);
+			return;
 		}
-	}, [ds, refresh]);
 
-	return { markdown: value, loading, error, refresh } as const;
+		void loadInstructions();
+
+		return () => {
+			activeControllerRef.current?.abort();
+		};
+	}, [ds, loadInstructions]);
+
+	useEffect(() => {
+		return () => {
+			activeControllerRef.current?.abort();
+		};
+	}, []);
+
+	return { markdown, loading, error, refresh } as const;
 }
 
 export default useCurationInstructions;
diff --git a/frontend/src/hooks/useGroundTruth.ts b/frontend/src/hooks/useGroundTruth.ts
index ab24eca..f721693 100644
--- a/frontend/src/hooks/useGroundTruth.ts
+++ b/frontend/src/hooks/useGroundTruth.ts
@@ -2,20 +2,27 @@ import { useCallback, useEffect, useMemo, useRef, useState } from "react";
 import { ApiProvider } from "../adapters/apiProvider";
 import { isDemoModeIgnored, shouldUseDemoProvider } from "../config/demo";
 import type {
+	ContextEntry,
 	ConversationTurn,
+	ExpectedTools,
 	GroundTruthItem,
 	Reference,
 } from "../models/groundTruth";
+import {
+	createConversationTurn,
+	ensureConversationTurnIdentity,
+	getItemReferences,
+	getLastAgentTurn,
+	getLastUserTurn,
+	getReferenceIdentityKey,
+	withDerivedLegacyFields,
+	withUpdatedReferences,
+} from "../models/groundTruth";
 import { canApproveCandidate } from "../models/gtHelpers";
 import type { Provider } from "../models/provider";
-import { randId } from "../models/utils";
-import {
-	type ChatReference,
-	callAgentChat,
-	formatConversationForAgent,
-	formatExpectedBehaviorForChat,
-} from "../services/chatService";
-import { mapApiErrorToMessage } from "../services/http";
+import { getReferenceApprovalRequirements } from "../models/validators";
+
+import { useRuntimeConfig } from "../services/runtimeConfig";
 import { addTags } from "../services/tags";
 import { logEvent } from "../services/telemetry";
 import { invalidateGroundTruthCache } from "./useGroundTruthCache";
@@ -28,11 +35,6 @@ type SaveResult =
 
 type ExportResult = { ok: true; json: string } | { ok: false; error: string };
 
-export type AgentGenerationResult =
-	| { ok: false; error: string }
-	| { ok: true; messageIndex: number }
-	| { ok: true };
-
 type UseGroundTruth = {
 	// Provider (exposed for ExportModal convenience)
 	provider: Provider | null;
@@ -80,11 +82,13 @@ type UseGroundTruth = {
 
 	// Multi-turn support
 	updateHistory: (history: ConversationTurn[]) => void;
-	addTurn: (role: "user" | "agent", content: string) => void;
+	addTurn: (role: string, content: string) => void;
 	deleteTurn: (messageIndex: number) => void;
-	regenerateAgentTurn: (messageIndex: number) => Promise<AgentGenerationResult>;
-	generateAgentTurn: (messageIndex: number) => Promise<AgentGenerationResult>;
-	runAgentTurn: (messageIndex: number) => Promise<AgentGenerationResult>;
+	// Context entries
+	updateContextEntries: (entries: ContextEntry[]) => void;
+
+	// Expected tools
+	updateExpectedTools: (tools: ExpectedTools) => void;
 
 	// Save + status
 	saving: boolean;
@@ -108,9 +112,17 @@ type UseGroundTruth = {
 	hasUnsaved: boolean;
 };
 
+function invalidateInspectCacheForItem(
+	item: Pick<GroundTruthItem, "datasetName" | "bucket" | "id">,
+): void {
+	if (item.datasetName && item.bucket && item.id) {
+		invalidateGroundTruthCache(item.datasetName, item.bucket, item.id);
+	}
+}
+
 // Pure helper to compute a stable fingerprint for unsaved detection
 function stateSignature(it: GroundTruthItem): string {
-	const refs = [...(it.references || [])]
+	const refs = [...getItemReferences(it)]
 		.map((r) => ({
 			id: r.id,
 			title: r.title || "",
@@ -120,15 +132,19 @@ function stateSignature(it: GroundTruthItem): string {
 			keyParagraph: (r.keyParagraph || "").trim(),
 			bonus: !!r.bonus,
 			messageIndex: r.messageIndex,
+			turnId: r.turnId,
+			toolCallId: r.toolCallId,
 		}))
 		.sort((a, b) => a.id.localeCompare(b.id));
 	return JSON.stringify({
 		id: it.id,
 		providerId: it.providerId,
-		question: (it.question || "").trim(),
-		answer: (it.answer || "").trim(),
+		question: getLastUserTurn(it).trim(),
+		answer: getLastAgentTurn(it).trim(),
 		comment: (it.comment ?? "").trim(),
-		history: it.history || [],
+		history: ensureConversationTurnIdentity(it.history),
+		contextEntries: it.contextEntries ?? [],
+		expectedTools: it.expectedTools ?? null,
 		references: refs,
 		manualTags: [...(it.manualTags || [])]
 			.map((t) => t.trim())
@@ -139,19 +155,57 @@ function stateSignature(it: GroundTruthItem): string {
 	});
 }
 
-function chatReferencesToGroundTruth(
-	chatRefs: ChatReference[],
-	messageIndex: number,
+function ensureEditableHistory(item: GroundTruthItem): ConversationTurn[] {
+	return ensureConversationTurnIdentity(item.history);
+}
+
+function withCanonicalHistory(
+	item: GroundTruthItem,
+	history: ConversationTurn[],
+): GroundTruthItem {
+	return withDerivedLegacyFields({
+		...item,
+		history: ensureConversationTurnIdentity(history),
+	});
+}
+
+function pruneReferencesForHistory(
+	refs: Reference[],
+	history: ConversationTurn[],
 ): Reference[] {
-	if (!chatRefs?.length) return [];
-	return chatRefs.map((ref) => ({
-		id: ref.id?.trim() || randId("ref"),
-		title: ref.title?.trim() || undefined,
-		url: ref.url?.trim() || "",
-		snippet: ref.snippet?.trim() || undefined,
-		keyParagraph: ref.keyParagraph?.trim() || undefined,
-		messageIndex,
-	}));
+	const nextTurnIds = new Set(
+		history.map((turn) => turn.turnId).filter(Boolean),
+	);
+	return refs
+		.map((ref) => {
+			if (ref.turnId) {
+				if (!nextTurnIds.has(ref.turnId)) {
+					return null;
+				}
+				return {
+					...ref,
+					messageIndex: undefined,
+				};
+			}
+			if (typeof ref.messageIndex !== "number") {
+				return ref;
+			}
+			if (ref.messageIndex < history.length) {
+				const turnId = history[ref.messageIndex]?.turnId;
+				return {
+					...ref,
+					messageIndex: turnId ? undefined : ref.messageIndex,
+					turnId,
+				};
+			}
+			return null;
+		})
+		.filter((ref): ref is Reference => ref !== null);
+}
+
+function withCanonicalItem(item: GroundTruthItem): GroundTruthItem {
+	const history = ensureEditableHistory(item);
+	return withCanonicalHistory(item, history);
 }
 
 function useGroundTruth(): UseGroundTruth {
@@ -166,11 +220,14 @@ function useGroundTruth(): UseGroundTruth {
 
 	// Search (delegated to sub-hook)
 	const { query, setQuery, searching, searchResults, runSearch, clearResults } =
-		useReferencesSearch({ getSeedQuery: () => current?.question });
+		useReferencesSearch({
+			getSeedQuery: () => (current ? getLastUserTurn(current) : undefined),
+		});
 
 	// Save idempotency
 	const [lastSavedStateFp, setLastSavedStateFp] = useState<string>("");
 	const [saving, setSaving] = useState(false);
+	const runtimeConfig = useRuntimeConfig();
 
 	// References editor (delegated to sub-hook)
 	const {
@@ -213,10 +270,15 @@ function useGroundTruth(): UseGroundTruth {
 				setSelectedId(first?.id ?? null);
 				if (first) {
 					// Prime the editor immediately so fields populate without waiting for a follow-up get()
-					const clone = JSON.parse(JSON.stringify(first)) as GroundTruthItem;
+					const clone = withCanonicalItem(
+						JSON.parse(JSON.stringify(first)) as GroundTruthItem,
+					);
 					setCurrent(clone);
-					qaBaseline.current = { q: first.question, a: first.answer };
-					setLastSavedStateFp(stateSignature(first));
+					qaBaseline.current = {
+						q: getLastUserTurn(first),
+						a: getLastAgentTurn(first),
+					};
+					setLastSavedStateFp(stateSignature(withCanonicalItem(first)));
 				}
 			} catch {
 				// Load errors are handled elsewhere via explicit actions.
@@ -244,19 +306,21 @@ function useGroundTruth(): UseGroundTruth {
 		(async () => {
 			const it = await p.get(selectedId);
 			if (!it) return;
-			const clone = JSON.parse(JSON.stringify(it)) as GroundTruthItem;
+			const clone = withCanonicalItem(
+				JSON.parse(JSON.stringify(it)) as GroundTruthItem,
+			);
 			setCurrent(clone);
-			qaBaseline.current = { q: it.question, a: it.answer };
+			qaBaseline.current = { q: getLastUserTurn(it), a: getLastAgentTurn(it) };
 			clearResults();
-			setLastSavedStateFp(stateSignature(it));
+			setLastSavedStateFp(stateSignature(withCanonicalItem(it)));
 		})();
 	}, [selectedId, clearResults, current?.id]);
 
 	const qaChanged = useMemo(() => {
 		if (!current) return false;
 		return (
-			current.question !== qaBaseline.current.q ||
-			current.answer !== qaBaseline.current.a
+			getLastUserTurn(current) !== qaBaseline.current.q ||
+			getLastAgentTurn(current) !== qaBaseline.current.a
 		);
 	}, [current]);
 
@@ -270,26 +334,65 @@ function useGroundTruth(): UseGroundTruth {
 	}, []);
 
 	const updateQuestion = useCallback((q: string) => {
-		setCurrent((prev) => (prev ? { ...prev, question: q } : prev));
+		setCurrent((prev) => {
+			if (!prev) return prev;
+			const history = ensureEditableHistory(prev);
+			let updated = false;
+			const nextHistory = [...history];
+			for (let i = nextHistory.length - 1; i >= 0; i--) {
+				if (nextHistory[i].role === "user") {
+					nextHistory[i] = { ...nextHistory[i], content: q };
+					updated = true;
+					break;
+				}
+			}
+			if (!updated) {
+				nextHistory.push(createConversationTurn({ role: "user", content: q }));
+			}
+			return withCanonicalHistory(prev, nextHistory);
+		});
 	}, []);
 	const updateAnswer = useCallback((a: string) => {
-		setCurrent((prev) => (prev ? { ...prev, answer: a } : prev));
+		setCurrent((prev) => {
+			if (!prev) return prev;
+			const history = ensureEditableHistory(prev);
+			let updated = false;
+			const nextHistory = [...history];
+			for (let i = nextHistory.length - 1; i >= 0; i--) {
+				if (nextHistory[i].role !== "user") {
+					nextHistory[i] = { ...nextHistory[i], content: a };
+					updated = true;
+					break;
+				}
+			}
+			if (!updated) {
+				nextHistory.push(createConversationTurn({ role: "agent", content: a }));
+			}
+			return withCanonicalHistory(prev, nextHistory);
+		});
 	}, []);
 	const updateComment = useCallback((v: string) => {
 		setCurrent((prev) => (prev ? { ...prev, comment: v } : prev));
 	}, []);
 
-	const canApprove = useMemo(() => canApproveCandidate(current), [current]);
+	const canApprove = useMemo(
+		() =>
+			canApproveCandidate(
+				current,
+				getReferenceApprovalRequirements(runtimeConfig),
+			),
+		[current, runtimeConfig],
+	);
 
 	const save = useCallback(
 		async (nextStatus?: GroundTruthItem["status"]): Promise<SaveResult> => {
 			const p = providerRef.current;
 			if (!current || !p || saving) return { ok: false, error: "Not ready" };
 
-			const candidate: GroundTruthItem = {
+			const candidate: GroundTruthItem = withCanonicalItem({
 				...current,
 				status: nextStatus || current.status,
-			};
+			});
 			const stateFp = stateSignature(candidate);
 			if (stateFp === lastSavedStateFp) {
 				return { ok: true, saved: current, message: "No changes" };
@@ -297,7 +400,10 @@ function useGroundTruth(): UseGroundTruth {
 
 			if (
 				["approved"].includes(candidate.status) &&
-				!canApproveCandidate(candidate)
+				!canApproveCandidate(
+					candidate,
+					getReferenceApprovalRequirements(runtimeConfig),
+				)
 			) {
 				return { ok: false, error: "References not complete for approval" };
 			}
@@ -308,31 +414,51 @@ function useGroundTruth(): UseGroundTruth {
 			try {
 				const prevBeforeSave = current; // capture to merge transient fields
 				const saved = await p.save(candidate);
-				// SA-232: Backend does not persist visitedAt; reattach any prior visitedAt values by URL.
-				const prevRefs = prevBeforeSave?.references;
-				if (prevRefs?.length && saved.references?.length) {
-					const visitedByUrl = new Map(
+				// SA-232: Backend does not persist visitedAt; reattach any prior visitedAt values by
+				// composite reference identity so duplicate URLs across turns/tool ownership stay distinct.
+				const prevRefs = getItemReferences(prevBeforeSave);
+				const savedRefs = getItemReferences(saved);
+				if (prevRefs.length && savedRefs.length) {
+					const visitedByReferenceIdentity = new Map(
 						prevRefs
 							.filter((r) => r.visitedAt)
-							.map((r) => [r.url, r.visitedAt as string]),
+							.map(
+								(r) =>
+									[getReferenceIdentityKey(r), r.visitedAt as string] as const,
+							),
 					);
-					for (const r of saved.references) {
+					let changed = false;
+					const merged = savedRefs.map((r) => {
 						if (!r.visitedAt) {
-							const v = visitedByUrl.get(r.url);
-							if (v) r.visitedAt = v;
+							const v = visitedByReferenceIdentity.get(
+								getReferenceIdentityKey(r),
+							);
+							if (v) {
+								changed = true;
+								return { ...r, visitedAt: v };
+							}
 						}
+						return r;
+					});
+					if (changed) {
+						// Re-write saved item with visitedAt merged
+						Object.assign(saved, withUpdatedReferences(saved, merged));
 					}
 				}
-				setItems((arr) => arr.map((i) => (i.id === saved.id ? saved : i)));
-				setCurrent(saved);
-				setLastSavedStateFp(stateSignature(saved));
+				const canonicalSaved = withCanonicalItem(saved);
+				setItems((arr) =>
+					arr.map((i) => (i.id === canonicalSaved.id ? canonicalSaved : i)),
+				);
+				setCurrent(canonicalSaved);
+				setLastSavedStateFp(stateSignature(canonicalSaved));
 				if (qaChanged)
-					qaBaseline.current = { q: saved.question, a: saved.answer };
+					qaBaseline.current = {
+						q: getLastUserTurn(canonicalSaved),
+						a: getLastAgentTurn(canonicalSaved),
+					};
 
 				// FR-002: Invalidate cache after successful save to ensure fresh data on next inspection
-				if (saved.datasetName && saved.bucket && saved.id) {
-					invalidateGroundTruthCache(saved.datasetName, saved.bucket, saved.id);
-				}
+				invalidateInspectCacheForItem(saved);
 
 				// Persist any new manual tags only after a successful save; fire-and-forget
 				try {
@@ -347,13 +473,13 @@ function useGroundTruth(): UseGroundTruth {
 				} catch {}
 				try {
 					const baseProps = {
-						providerId: saved.providerId,
-						itemId: saved.id,
-						status: saved.status,
-						selectedRefCount: saved.references?.length,
+						providerId: canonicalSaved.providerId,
+						itemId: canonicalSaved.id,
+						status: canonicalSaved.status,
+						selectedRefCount: getItemReferences(canonicalSaved).length,
 						durationMs: Date.now() - started,
 					};
-					if (nextStatus === "approved" || saved.status === "approved")
+					if (nextStatus === "approved" || canonicalSaved.status === "approved")
 						logEvent("gtc.approve", baseProps);
 					else logEvent("gtc.save_draft", baseProps);
 				} catch {}
@@ -365,7 +491,7 @@ function useGroundTruth(): UseGroundTruth {
 				setSaving(false);
 			}
 		},
-		[current, saving, lastSavedStateFp, qaChanged],
+		[current, saving, lastSavedStateFp, qaChanged, runtimeConfig],
 	);
 
 	// Determine if current item differs from last saved state
@@ -414,11 +540,16 @@ function useGroundTruth(): UseGroundTruth {
 					return false;
 				}
 
-				const clone = JSON.parse(JSON.stringify(it)) as GroundTruthItem;
+				const clone = withCanonicalItem(
+					JSON.parse(JSON.stringify(it)) as GroundTruthItem,
+				);
 				setCurrent(clone);
-				qaBaseline.current = { q: it.question, a: it.answer };
+				qaBaseline.current = {
+					q: getLastUserTurn(it),
+					a: getLastAgentTurn(it),
+				};
 				clearResults();
-				setLastSavedStateFp(stateSignature(it));
+				setLastSavedStateFp(stateSignature(withCanonicalItem(it)));
 				setSelectedId(id);
 				return true;
 			} catch (error) {
@@ -457,10 +588,13 @@ function useGroundTruth(): UseGroundTruth {
 			const p = providerRef.current;
 			if (!current || !p) return { ok: false, error: "No current" };
 			try {
-				const saved = await p.save({ ...current, deleted: nextDeleted });
+				const saved = withCanonicalItem(
+					await p.save({ ...current, deleted: nextDeleted }),
+				);
 				setItems((arr) => arr.map((i) => (i.id === saved.id ? saved : i)));
 				setCurrent(saved);
 				setLastSavedStateFp(stateSignature(saved));
+				invalidateInspectCacheForItem(saved);
 				try {
 					logEvent(nextDeleted ? "gtc.soft_delete" : "gtc.restore", {
 						providerId: saved.providerId,
@@ -484,12 +618,15 @@ function useGroundTruth(): UseGroundTruth {
 			const it = items.find((i) => i.id === itemId);
 			if (!it) return { ok: false, error: "Item not found" };
 			try {
-				const saved = await p.save({ ...it, deleted: nextDeleted });
+				const saved = withCanonicalItem(
+					await p.save({ ...it, deleted: nextDeleted }),
+				);
 				setItems((arr) => arr.map((i) => (i.id === saved.id ? saved : i)));
 				if (current && current.id === saved.id) {
 					setCurrent(saved);
 					setLastSavedStateFp(stateSignature(saved));
 				}
+				invalidateInspectCacheForItem(saved);
 				try {
 					logEvent(nextDeleted ? "gtc.soft_delete" : "gtc.restore", {
 						providerId: saved.providerId,
@@ -513,319 +650,69 @@ function useGroundTruth(): UseGroundTruth {
 	const updateHistory = useCallback((history: ConversationTurn[]) => {
 		setCurrent((prev) => {
 			if (!prev) return prev;
-
-			// Find last user and agent turns in single reverse iteration
-			let lastUser: ConversationTurn | undefined;
-			let lastAgent: ConversationTurn | undefined;
-
-			for (let i = history.length - 1; i >= 0; i--) {
-				const turn = history[i];
-				if (turn.role === "user" && !lastUser) {
-					lastUser = turn;
-				} else if (turn.role === "agent" && !lastAgent) {
-					lastAgent = turn;
-				}
-
-				// Early exit if both found
-				if (lastUser && lastAgent) break;
-			}
-
-			return {
-				...prev,
-				history,
-				question: lastUser?.content || prev.question,
-				answer: lastAgent?.content || prev.answer,
-			};
+			return withCanonicalHistory(prev, history);
 		});
 	}, []);
 
-	const addTurn = useCallback((role: "user" | "agent", content: string) => {
+	const addTurn = useCallback((role: string, content: string) => {
 		setCurrent((prev) => {
 			if (!prev) return prev;
 			const newHistory: ConversationTurn[] = [
-				...(prev.history || []),
-				{ role, content },
+				...ensureEditableHistory(prev),
+				createConversationTurn({ role, content }),
 			];
-			// Sync to question/answer
-			const lastUser = [...newHistory].reverse().find((t) => t.role === "user");
-			const lastAgent = [...newHistory]
-				.reverse()
-				.find((t) => t.role === "agent");
-			return {
-				...prev,
-				history: newHistory,
-				question: lastUser?.content || prev.question,
-				answer: lastAgent?.content || prev.answer,
-			};
+			return withCanonicalHistory(prev, newHistory);
 		});
 	}, []);
 
 	const deleteTurn = useCallback((messageIndex: number) => {
 		setCurrent((prev) => {
 			if (!prev) return prev;
-			const history = prev.history || [];
+			const history = ensureEditableHistory(prev);
 			if (messageIndex < 0 || messageIndex >= history.length) return prev;
 
+			const deletedTurnId = history[messageIndex]?.turnId;
 			// Remove the turn at the specified index
 			const newHistory = history.filter((_, i) => i !== messageIndex);
 
-			// Re-index references: shift down any references with messageIndex > deleted index
-			const updatedReferences = (prev.references || [])
-				.map((ref) => {
-					if (typeof ref.messageIndex !== "number") return ref;
-
-					// Remove references for the deleted turn
-					if (ref.messageIndex === messageIndex) {
-						return null;
-					}
-
-					// Shift down references after the deleted turn
-					if (ref.messageIndex > messageIndex) {
-						return { ...ref, messageIndex: ref.messageIndex - 1 };
-					}
+			const currentRefs = getItemReferences(prev).filter((ref) =>
+				ref.turnId
+					? ref.turnId !== deletedTurnId
+					: ref.messageIndex !== messageIndex,
+			);
+			const updatedReferences = pruneReferencesForHistory(
+				currentRefs,
+				newHistory,
+			);
 
-					return ref;
-				})
-				.filter((ref): ref is Reference => ref !== null);
-
-			// Sync last user/agent turns to question/answer for backward compatibility
-			const lastUser = [...newHistory].reverse().find((t) => t.role === "user");
-			const lastAgent = [...newHistory]
-				.reverse()
-				.find((t) => t.role === "agent");
-
-			return {
-				...prev,
-				history: newHistory,
-				references: updatedReferences,
-				question: lastUser?.content || "",
-				answer: lastAgent?.content || "",
-			};
+			return withUpdatedReferences(
+				withCanonicalHistory(prev, newHistory),
+				updatedReferences,
+			);
 		});
 	}, []);
 
-	const appendAgentTurn =
-		useCallback(async (): Promise<AgentGenerationResult> => {
-			const item = current;
-			if (!item)
-				return { ok: false, error: "Select a ground truth item first." };
-			const history = item.history || [];
-			if (!history.length)
-				return {
-					ok: false,
-					error: "Add a user turn before requesting an agent response.",
-				};
-			const lastTurn = history[history.length - 1];
-			if (lastTurn.role !== "user")
-				return {
-					ok: false,
-					error: "Add a user message before requesting an agent response.",
-				};
-			const transcript = formatConversationForAgent(history);
-			if (!transcript)
-				return {
-					ok: false,
-					error: "Conversation history is empty.",
-				};
-			const targetId = item.id;
-			const started = Date.now();
-			try {
-				const { content, references } = await callAgentChat(transcript);
-				const trimmed = content.trim();
-				if (!trimmed)
-					return {
-						ok: false,
-						error: "Agent returned an empty response.",
-					};
-				const newMessageIndex = history.length;
-				setCurrent((prev) => {
-					if (!prev || prev.id !== targetId) return prev;
-					const prevHistory = prev.history
-						? [...prev.history]
-						: ([] as ConversationTurn[]);
-					const nextHistory: ConversationTurn[] = [
-						...prevHistory,
-						{ role: "agent", content: trimmed },
-					];
-					const lastUser = [...nextHistory]
-						.reverse()
-						.find((turn) => turn.role === "user");
-					const mappedRefs = chatReferencesToGroundTruth(
-						references,
-						newMessageIndex,
-					);
-					const filteredRefs = (prev.references || []).filter(
-						(ref) => ref.messageIndex !== newMessageIndex,
-					);
-					return {
-						...prev,
-						history: nextHistory,
-						references: [...filteredRefs, ...mappedRefs],
-						question: lastUser?.content || prev.question,
-						answer: trimmed,
-					};
-				});
-				try {
-					logEvent("gtc.agent_turn_add", {
-						referenceCount: references.length,
-						messageIndex: newMessageIndex,
-						durationMs: Date.now() - started,
-					});
-				} catch {}
-				return { ok: true as const, messageIndex: newMessageIndex };
-			} catch (err) {
-				const message = mapApiErrorToMessage(err);
-				try {
-					logEvent("gtc.agent_turn_error", {
-						stage: "add",
-						error: message,
-					});
-				} catch {}
-				return { ok: false as const, error: message };
-			}
-		}, [current]);
-
-	const regenerateAgentTurn = useCallback(
-		async (messageIndex: number): Promise<AgentGenerationResult> => {
-			const item = current;
-			if (!item)
-				return { ok: false, error: "Select a ground truth item first." };
-
-			const history = item.history || [];
-			if (messageIndex < 0 || messageIndex >= history.length)
-				return { ok: false, error: "Turn index is out of range." };
-
-			const targetTurn = history[messageIndex];
-			if (targetTurn.role !== "agent")
-				return { ok: false, error: "Only agent turns can be regenerated." };
-
-			// Format conversation history up to this turn
-			const transcript = formatConversationForAgent(history, messageIndex);
-			if (!transcript)
-				return { ok: false, error: "Conversation history is empty." };
-
-			// Append expected behavior if present
-			const expectedBehaviorStr = formatExpectedBehaviorForChat(
-				targetTurn.expectedBehavior,
-			);
-			const messageWithBehavior = expectedBehaviorStr
-				? `${transcript}\n\n${expectedBehaviorStr}`
-				: transcript;
-
-			const targetId = item.id;
-			const started = Date.now();
-
-			try {
-				const { content, references } =
-					await callAgentChat(messageWithBehavior);
-				const trimmed = content.trim();
-				if (!trimmed)
-					return {
-						ok: false,
-						error: "Agent returned an empty response.",
-					};
-
-				// Single state update with all changes (React 18 auto-batches)
-				setCurrent((prev) => {
-					if (!prev || prev.id !== targetId) return prev;
-
-					// Optimize: only copy/update what changed
-					const updatedHistory = prev.history
-						? [...prev.history] // Still need to copy for immutability
-						: [];
-
-					// Direct index update instead of map (O(1) vs O(n))
-					if (messageIndex < updatedHistory.length) {
-						updatedHistory[messageIndex] = {
-							...updatedHistory[messageIndex],
-							content: trimmed,
-						};
-					}
-
-					// Filter refs for this turn only
-					const refsToKeep =
-						prev.references?.filter(
-							(ref) => ref.messageIndex !== messageIndex,
-						) || [];
-
-					const mappedRefs = chatReferencesToGroundTruth(
-						references,
-						messageIndex,
-					);
-
-					// Find last agent turn efficiently (reverse search, early exit)
-					let lastAgentContent = prev.answer;
-					for (let i = updatedHistory.length - 1; i >= 0; i--) {
-						if (updatedHistory[i].role === "agent") {
-							lastAgentContent = updatedHistory[i].content;
-							break;
-						}
-					}
-
-					// Single state update with all changes
-					return {
-						...prev,
-						history: updatedHistory,
-						references: [...refsToKeep, ...mappedRefs],
-						answer: lastAgentContent,
-					};
-				});
-
-				// Fire-and-forget logging (non-blocking)
-				try {
-					logEvent("gtc.agent_turn_regenerate", {
-						referenceCount: references.length,
-						messageIndex,
-						hasExpectedBehavior: !!expectedBehaviorStr,
-						durationMs: Date.now() - started,
-					});
-				} catch {}
-
-				return { ok: true as const, messageIndex };
-			} catch (err) {
-				const message = mapApiErrorToMessage(err);
-				try {
-					logEvent("gtc.agent_turn_error", {
-						stage: "regenerate",
-						error: message,
-					});
-				} catch {}
-				return { ok: false as const, error: message };
-			}
-		},
-		[current],
-	);
-
-	const generateAgentTurn = useCallback(
-		async (messageIndex: number): Promise<AgentGenerationResult> => {
-			if (messageIndex < 0) return appendAgentTurn();
-			return regenerateAgentTurn(messageIndex);
-		},
-		[appendAgentTurn, regenerateAgentTurn],
-	);
+	const updateContextEntries = useCallback((entries: ContextEntry[]) => {
+		setCurrent((prev) => (prev ? { ...prev, contextEntries: entries } : prev));
+	}, []);
 
-	/**
-	 * Run full agent with tools (searches + retrieval) to regenerate an agent turn.
-	 * Updates both the answer and references for the turn.
-	 * This is the same as regenerateAgentTurn.
-	 */
-	const runAgentTurn = useCallback(
-		async (messageIndex: number): Promise<AgentGenerationResult> => {
-			return regenerateAgentTurn(messageIndex);
-		},
-		[regenerateAgentTurn],
-	);
+	const updateExpectedTools = useCallback((tools: ExpectedTools) => {
+		setCurrent((prev) => (prev ? { ...prev, expectedTools: tools } : prev));
+	}, []);
 
 	const duplicateCurrent = useCallback(async () => {
 		const p = providerRef.current;
 		if (!current || !p) return { ok: false as const, error: "No current" };
 		try {
-			const created = await p.duplicate(current);
+			const created = withCanonicalItem(await p.duplicate(current));
 			// Insert at top of list and select it
 			setItems((arr) => [created, ...arr]);
 			setSelectedId(created.id);
 			setCurrent(JSON.parse(JSON.stringify(created)) as GroundTruthItem);
-			qaBaseline.current = { q: created.question, a: created.answer };
+			qaBaseline.current = {
+				q: getLastUserTurn(created),
+				a: getLastAgentTurn(created),
+			};
 			setLastSavedStateFp(stateSignature(created));
 			try {
 				logEvent("gtc.duplicate_rephrase", {
@@ -865,9 +752,8 @@ function useGroundTruth(): UseGroundTruth {
 		updateHistory,
 		addTurn,
 		deleteTurn,
-		regenerateAgentTurn,
-		generateAgentTurn,
-		runAgentTurn,
+		updateContextEntries,
+		updateExpectedTools,
 		saving,
 		save,
 		canApprove,
diff --git a/frontend/src/hooks/useReferencesEditor.ts b/frontend/src/hooks/useReferencesEditor.ts
index a3cbba7..d638a3c 100644
--- a/frontend/src/hooks/useReferencesEditor.ts
+++ b/frontend/src/hooks/useReferencesEditor.ts
@@ -1,5 +1,9 @@
 import { useCallback, useRef } from "react";
 import type { GroundTruthItem, Reference } from "../models/groundTruth";
+import {
+	getItemReferences,
+	withUpdatedReferences,
+} from "../models/groundTruth";
 import { dedupeReferences } from "../models/gtHelpers";
 import { nowIso } from "../models/utils";
 
@@ -26,12 +30,11 @@ export function useReferencesEditor(options: {
 		(refId: string, patch: Partial<Reference>) => {
 			setCurrent((prev) => {
 				if (!prev) return prev;
-				return {
-					...prev,
-					references: prev.references.map((r) =>
-						r.id === refId ? { ...r, ...patch } : r,
-					),
-				};
+				const refs = getItemReferences(prev);
+				const updated = refs.map((r) =>
+					r.id === refId ? { ...r, ...patch } : r,
+				);
+				return withUpdatedReferences(prev, updated);
 			});
 		},
 		[setCurrent],
@@ -41,10 +44,8 @@ export function useReferencesEditor(options: {
 		(chosen: Reference[]) => {
 			setCurrent((prev) => {
 				if (!prev) return prev;
-				return {
-					...prev,
-					references: dedupeReferences(prev.references, chosen),
-				};
+				const refs = getItemReferences(prev);
+				return withUpdatedReferences(prev, dedupeReferences(refs, chosen));
 			});
 		},
 		[setCurrent],
@@ -54,19 +55,21 @@ export function useReferencesEditor(options: {
 		(refId: string, onUndoRegister: UndoRegistrar) => {
 			setCurrent((prev) => {
 				if (!prev) return prev;
-				const idx = prev.references.findIndex((r) => r.id === refId);
+				const refs = getItemReferences(prev);
+				const idx = refs.findIndex((r) => r.id === refId);
 				if (idx < 0) return prev;
-				const ref = prev.references[idx];
-				const nextRefs = prev.references.filter((_, i) => i !== idx);
+				const ref = refs[idx];
+				const nextRefs = refs.filter((_, i) => i !== idx);
 				const doUndo = () => {
 					setCurrent((p2) => {
 						if (!p2) return p2;
-						const present = p2.references.some((r) => r.id === ref.id);
+						const currentRefs = getItemReferences(p2);
+						const present = currentRefs.some((r) => r.id === ref.id);
 						if (present) return p2;
-						const arr = [...p2.references];
+						const arr = [...currentRefs];
 						const insertAt = Math.min(idx, arr.length);
 						arr.splice(insertAt, 0, ref);
-						return { ...p2, references: arr };
+						return withUpdatedReferences(p2, arr);
 					});
 					if (undoTimer.current) window.clearTimeout(undoTimer.current);
 				};
@@ -76,7 +79,7 @@ export function useReferencesEditor(options: {
 					() => {},
 					8000,
 				) as unknown as number;
-				return { ...prev, references: nextRefs };
+				return withUpdatedReferences(prev, nextRefs);
 			});
 		},
 		[setCurrent],
diff --git a/frontend/src/hooks/useReferencesSearch.ts b/frontend/src/hooks/useReferencesSearch.ts
index 997b3ab..73b5341 100644
--- a/frontend/src/hooks/useReferencesSearch.ts
+++ b/frontend/src/hooks/useReferencesSearch.ts
@@ -1,5 +1,5 @@
-import { useCallback, useState } from "react";
-import DEMO_MODE from "../config/demo";
+import { useCallback, useEffect, useRef, useState } from "react";
+import { shouldUseDemoProvider } from "../config/demo";
 import type { Reference } from "../models/groundTruth";
 import { mockAiSearch, searchReferences } from "../services/search";
 
@@ -19,25 +19,52 @@ export function useReferencesSearch(options: {
 	const [query, setQuery] = useState("");
 	const [searching, setSearching] = useState(false);
 	const [searchResults, setSearchResults] = useState<Reference[]>([]);
+	const activeControllerRef = useRef<AbortController | null>(null);
+	const requestIdRef = useRef(0);
+
+	useEffect(() => {
+		return () => {
+			activeControllerRef.current?.abort();
+		};
+	}, []);
 
 	const runSearch = useCallback(async () => {
 		const q = (query || getSeedQuery() || "").trim();
+		activeControllerRef.current?.abort();
+
 		if (!q) {
+			setSearching(false);
 			setSearchResults([]);
 			return;
 		}
+
+		const controller = new AbortController();
+		activeControllerRef.current = controller;
+		const requestId = ++requestIdRef.current;
 		setSearching(true);
+
 		try {
-			const results = DEMO_MODE
-				? await mockAiSearch(q)
-				: await searchReferences(q, 10);
+			const results = shouldUseDemoProvider()
+				? await mockAiSearch(q, controller.signal)
+				: await searchReferences(q, 10, controller.signal);
+			if (requestId !== requestIdRef.current || controller.signal.aborted)
+				return;
 			setSearchResults(results);
+		} catch (error) {
+			if (controller.signal.aborted) return;
+			throw error;
 		} finally {
-			setSearching(false);
+			if (requestId === requestIdRef.current && !controller.signal.aborted) {
+				setSearching(false);
+			}
 		}
 	}, [query, getSeedQuery]);
 
-	const clearResults = useCallback(() => setSearchResults([]), []);
+	const clearResults = useCallback(() => {
+		activeControllerRef.current?.abort();
+		setSearching(false);
+		setSearchResults([]);
+	}, []);
 
 	return { query, setQuery, searching, searchResults, runSearch, clearResults };
 }
diff --git a/frontend/src/hooks/useTagGlossary.ts b/frontend/src/hooks/useTagGlossary.ts
index e9b8754..025f837 100644
--- a/frontend/src/hooks/useTagGlossary.ts
+++ b/frontend/src/hooks/useTagGlossary.ts
@@ -1,12 +1,14 @@
 import { useEffect, useSyncExternalStore } from "react";
 import type { components } from "../api/generated";
+import {
+	buildTagGlossaryMap,
+	clearTagGlossaryCache,
+	fetchTagGlossary,
+	type TagGlossary,
+} from "../services/tags";
 
 type GlossaryResponse = components["schemas"]["GlossaryResponse"];
 
-export interface TagGlossary {
-	[tagKey: string]: string | undefined;
-}
-
 // Detect test environment
 const isTestEnvironment =
 	typeof process !== "undefined" && process.env.NODE_ENV === "test";
@@ -62,7 +64,7 @@ class GlossaryStore {
 		}
 	}
 
-	async fetch() {
+	async fetch(force = false) {
 		// Skip fetching in test environment
 		if (isTestEnvironment) {
 			this.loading = false;
@@ -80,31 +82,15 @@ class GlossaryStore {
 				this.loading = true;
 				this.notify();
 
-				const response = await fetch("/v1/tags/glossary");
-				if (!response.ok) {
-					throw new Error("Failed to fetch tag glossary");
-				}
-				const data: GlossaryResponse = await response.json();
-
-				// Store raw glossary response
-				this.rawGlossary = data;
-
-				// Build lookup map: tag key -> description
-				const glossaryMap: TagGlossary = {};
-				for (const group of data.groups || []) {
-					for (const tag of group.tags || []) {
-						if (tag.key && tag.description) {
-							glossaryMap[tag.key] = tag.description;
-						}
-					}
-				}
-
-				this.glossary = glossaryMap;
+				const glossary = await fetchTagGlossary({ force });
+				this.rawGlossary = glossary;
+				this.glossary = buildTagGlossaryMap(glossary);
 				this.error = null;
-			} catch (err) {
-				this.error = err instanceof Error ? err : new Error(String(err));
+			} catch (error) {
+				this.error = error instanceof Error ? error : new Error(String(error));
 			} finally {
 				this.loading = false;
+				this.fetchPromise = null;
 				this.notify();
 			}
 		})();
@@ -113,6 +99,7 @@ class GlossaryStore {
 	}
 
 	clear() {
+		clearTagGlossaryCache();
 		this.glossary = {};
 		this.rawGlossary = null;
 		this.loading = false;
@@ -123,7 +110,8 @@ class GlossaryStore {
 
 	refresh() {
 		this.fetchPromise = null;
-		return this.fetch();
+		clearTagGlossaryCache();
+		return this.fetch(true);
 	}
 
 	setGlossary(glossary: TagGlossary) {
diff --git a/frontend/src/hooks/useTags.ts b/frontend/src/hooks/useTags.ts
index cfd5a11..16c2214 100644
--- a/frontend/src/hooks/useTags.ts
+++ b/frontend/src/hooks/useTags.ts
@@ -1,46 +1,62 @@
-import { useCallback, useEffect, useState } from "react";
-import { fetchAvailableTags } from "../services/tags";
+import { useCallback, useEffect, useSyncExternalStore } from "react";
+import {
+	ensureTagCached,
+	fetchTagsWithComputed,
+	getTagMetadataSnapshot,
+	subscribeToTagMetadata,
+} from "../services/tags";
 
-function useTags() {
-	const [allTags, setAllTags] = useState<string[]>([]);
-	const [loading, setLoading] = useState<boolean>(false);
-	const [error, setError] = useState<string | null>(null);
+interface UseTagsOptions {
+	enabled?: boolean;
+}
 
-	const refresh = useCallback(async () => {
-		setLoading(true);
-		setError(null);
-		try {
-			const tags = await fetchAvailableTags();
-			if (Array.isArray(tags)) setAllTags(tags);
-		} catch (e) {
-			const msg = e instanceof Error ? e.message : String(e);
-			setError(msg);
-		} finally {
-			setLoading(false);
+function useTags(options?: UseTagsOptions) {
+	const enabled = options?.enabled ?? true;
+	const state = useSyncExternalStore(
+		subscribeToTagMetadata,
+		getTagMetadataSnapshot,
+	);
+
+	useEffect(() => {
+		if (!enabled) {
+			return;
 		}
+
+		fetchTagsWithComputed().catch(() => {
+			// Service snapshot already captures the error state.
+		});
+	}, [enabled]);
+
+	const refresh = useCallback(async () => {
+		await fetchTagsWithComputed({ force: true });
 	}, []);
 
-	useEffect(() => {
-		refresh();
-	}, [refresh]);
-
-	const ensureTag = useCallback(async (tag: string) => {
-		const t = (tag || "").trim();
-		if (!t) return;
-		// Optimistic add locally; do not POST yet — defer until item save
-		setAllTags((prev) => (prev.includes(t) ? prev : [...prev, t].sort()));
+	const ensureTag = useCallback((tag: string) => {
+		ensureTagCached(tag);
 	}, []);
 
 	const filter = useCallback(
 		(q: string) => {
-			const s = (q || "").toLowerCase();
-			if (!s) return allTags;
-			return allTags.filter((t) => t.toLowerCase().includes(s));
+			const search = (q || "").toLowerCase();
+			if (!search) {
+				return state.allTags;
+			}
+
+			return state.allTags.filter((tag) => tag.toLowerCase().includes(search));
 		},
-		[allTags],
+		[state.allTags],
 	);
 
-	return { allTags, loading, error, refresh, ensureTag, filter };
+	return {
+		allTags: state.allTags,
+		manualTags: state.manualTags,
+		computedTags: state.computedTags,
+		loading: state.loading,
+		error: state.error?.message ?? null,
+		refresh,
+		ensureTag,
+		filter,
+	};
 }
 
 export default useTags;
diff --git a/frontend/src/models/demoData.ts b/frontend/src/models/demoData.ts
index 44f5ae6..07d7513 100644
--- a/frontend/src/models/demoData.ts
+++ b/frontend/src/models/demoData.ts
@@ -4,70 +4,493 @@ import { JsonProvider } from "./provider";
 
 export const DEMO_JSON: GroundTruthItem[] = [
 	{
-		id: "GT-0001",
+		id: "demo-data-overage",
 		providerId: "json",
-		question: "How do I reset my password in the application?",
+		question: "CX IS USING TOO MUCH DATA AND WANTS TO KNOW WHY",
 		answer:
-			"To reset your password, navigate to User Settings > Security and click 'Reset Password'. You will receive an email with a link to create a new password.",
-		references: [
+			"The RCA shows the line exceeded the 50 GB plan cap after extended streaming and hotspot activity stayed on cellular data instead of Wi-Fi.",
+		history: [
 			{
-				id: "r1",
-				title: "User Guide - Account Security",
-				url: "https://example.com/docs/account-security",
-				snippet:
-					"Click the Reset Password button in the Security tab to initiate the recovery flow.",
-				visitedAt: null,
-				keyParagraph: "",
+				role: "user",
+				content: "CX IS USING TOO MUCH DATA AND WANTS TO KNOW WHY",
 			},
 			{
-				id: "r2",
-				title: "Troubleshooting Login Issues",
-				url: "https://example.com/docs/troubleshooting-login",
-				snippet:
-					"If you forgot your password, use the reset feature in User Settings.",
-				visitedAt: null,
-				keyParagraph: "",
+				role: "assistant",
+				content:
+					"The usage spike came from long streaming sessions and tethering while the handset was off Wi-Fi. No outage or provisioning defect was detected, so the best resolution is to explain the plan cap and coach the customer toward Wi-Fi-heavy usage.",
+			},
+			{
+				role: "assistant",
+				content:
+					"Root Cause Analysis: The line exceeded its 50 GB plan cap due to extended video streaming (≈38 GB) and mobile hotspot tethering (≈9 GB) over cellular while Wi-Fi was available but not connected. No network fault or provisioning error contributed.",
+			},
+		],
+		contextEntries: [
+			{ key: "impacted_device_type", value: "MSISDN" },
+			{ key: "metric_name", value: "user feedback" },
+			{ key: "resolution", value: "CX WAS NOT ON WIFI AND OVERLY USING DATA" },
+		],
+		toolCalls: [
+			{
+				id: "tool-001",
+				name: "get_location",
+				callType: "tool",
+				stepNumber: 1,
+				arguments: { msisdn: "[REDACTED_MSISDN_001]", context: null },
+				response: {
+					result: {
+						response: {
+							items: [{ valueObject: { location: { wifiConnected: false } } }],
+						},
+					},
+					executionTimeSeconds: 2.41,
+				},
+			},
+			{
+				id: "tool-002",
+				name: "get_plan_usage",
+				callType: "tool",
+				stepNumber: 2,
+				arguments: { msisdn: "[REDACTED_MSISDN_001]", context: null },
+				response: {
+					result: {
+						response: {
+							items: [{ valueObject: { planLimitGb: 50, usageGb: 63 } }],
+						},
+					},
+					executionTimeSeconds: 1.83,
+				},
+			},
+			{
+				id: "tool-003",
+				name: "Billing_agent",
+				callType: "tool",
+				stepNumber: 3,
+				arguments: { msisdn: "[REDACTED_MSISDN_001]", context: null },
+				response: {
+					result: {
+						response: {
+							summary: "Overage charges align with the plan cap breach.",
+						},
+					},
+					executionTimeSeconds: 1.17,
+				},
 			},
 		],
+		expectedTools: {
+			required: [{ name: "get_plan_usage" }, { name: "Billing_agent" }],
+		},
+		feedback: [
+			{
+				source: "trace-export-ratings",
+				values: {
+					"The recommended resolution was correct and appropriate": 2,
+					"The explanation and investigation areas were relevant to the issue": 2,
+				},
+			},
+		],
+		metadata: {
+			sourceFormat: "trace-export",
+			datasetTheme: "customer-feedback",
+		},
+		traceIds: {
+			traceId: "demo-trace-001",
+			conversationId: "demo-cid-001",
+		},
+		tracePayload: {
+			resolution: "CX WAS NOT ON WIFI AND OVERLY USING DATA",
+			impacted_device_type: "MSISDN",
+		},
+		plugins: {
+			"rag-compat": {
+				kind: "rag-compat",
+				version: "1.0",
+				data: {
+					retrievals: {
+						"tool-002": {
+							candidates: [
+								{
+									url: "https://telco.example.com/help/data-usage/check-usage",
+									title: "Check mobile data usage",
+									chunk:
+										"Compare the current-cycle data total with the plan cap before treating the usage as anomalous.",
+									toolCallId: "tool-002",
+								},
+							],
+						},
+						"tool-003": {
+							candidates: [
+								{
+									url: "https://telco.example.com/help/data-usage/wifi-assist",
+									title: "Reduce cellular usage with Wi-Fi",
+									chunk:
+										"Streaming over cellular is a common source of overage charges when Wi-Fi is available.",
+									toolCallId: "tool-003",
+								},
+							],
+						},
+					},
+				},
+			},
+		},
 		status: "draft",
 		deleted: false,
-		tags: ["account", "security", "password"],
-		comment: "Example of a standard procedural question.",
+		tags: ["data-usage", "billing", "wifi"],
+		comment: "Trace-style draft item with tool evidence and curator ratings.",
+		curationInstructions: `
+### Curation Guidelines (Customer Feedback)
+
+- Keep the customer symptom intact before refining the RCA.
+- Tie the answer back to the specific tool evidence.
+- If the trace shows no defect, say so clearly.
+`,
+		datasetName: "customer-feedback",
+	},
+	{
+		id: "demo-roaming-pass-timing",
+		providerId: "json",
+		question:
+			"CUSTOMER WAS CHARGED ROAMING FEES EVEN THOUGH THEY BOUGHT A PASS",
+		answer:
+			"The travel pass activated after the first charged roaming session, so the early usage billed at standard rates and later usage correctly switched to the pass.",
+		history: [
+			{
+				role: "user",
+				content:
+					"CUSTOMER WAS CHARGED ROAMING FEES EVEN THOUGH THEY BOUGHT A PASS",
+			},
+			{
+				role: "assistant",
+				content:
+					"The billing timeline is consistent with the pass order. The pass was not active for the first day of travel, so those sessions remained chargeable, and no network defect was present.",
+			},
+		],
+		contextEntries: [
+			{ key: "impacted_device_type", value: "MSISDN" },
+			{ key: "metric_name", value: "user feedback" },
+			{
+				key: "resolution",
+				value: "ROAMING PASS ACTIVATED AFTER THE FIRST CHARGED SESSION",
+			},
+		],
+		toolCalls: [
+			{
+				id: "tool-201",
+				name: "get_roaming_usage",
+				callType: "tool",
+				stepNumber: 1,
+				arguments: { msisdn: "[REDACTED_MSISDN_003]", context: null },
+				response: {
+					result: {
+						response: {
+							items: [
+								{ valueObject: { chargedSessions: 3, passCoveredSessions: 9 } },
+							],
+						},
+					},
+					executionTimeSeconds: 1.58,
+				},
+			},
+			{
+				id: "tool-203",
+				name: "Billing_agent",
+				callType: "tool",
+				stepNumber: 2,
+				arguments: { msisdn: "[REDACTED_MSISDN_003]", context: null },
+				response: {
+					result: {
+						response: {
+							summary: "Billing timeline and pass activation are aligned.",
+						},
+					},
+					executionTimeSeconds: 1.11,
+				},
+			},
+		],
+		expectedTools: {
+			required: [{ name: "get_roaming_usage" }, { name: "Billing_agent" }],
+		},
+		feedback: [
+			{
+				source: "trace-export-ratings",
+				values: {
+					"The recommended resolution was correct and appropriate": 2,
+					"The RCA included all key information (nothing important was missed)": 2,
+				},
+			},
+		],
+		metadata: {
+			sourceFormat: "trace-export",
+			datasetTheme: "network-diagnostics",
+		},
+		traceIds: {
+			traceId: "demo-trace-003",
+			conversationId: "demo-cid-003",
+		},
+		plugins: {
+			"rag-compat": {
+				kind: "rag-compat",
+				version: "1.0",
+				data: {
+					retrievals: {
+						"tool-201": {
+							candidates: [
+								{
+									url: "https://telco.example.com/help/roaming/travel-pass-timing",
+									title: "Travel pass activation timing",
+									chunk:
+										"Travel passes only apply after activation and do not retroactively cover earlier roaming sessions.",
+									toolCallId: "tool-201",
+								},
+							],
+						},
+					},
+				},
+			},
+		},
+		status: "approved",
+		deleted: false,
+		tags: ["roaming", "billing", "approved"],
+		comment: "Approved trace-style example for billing RCA review.",
 		curationInstructions: `
-### Curation Guidelines (Account Security)
+### Curation Guidelines (Network Diagnostics)
 
-- Ensure the answer is direct and step-by-step.
-- Verify that the 'Reset Password' button location is accurately described.
-- Link to the official User Guide whenever possible.
+- Distinguish billing timing from device or network faults.
+- State when the evidence rules out an outage.
+- Keep the root cause specific and checkable.
 `,
+		datasetName: "network-diagnostics",
+	},
+	{
+		id: "demo-stadium-congestion",
+		providerId: "json",
+		question:
+			"CUSTOMER THINKS THERE WAS AN OUTAGE WHEN DATA SLOWED DOWN AT A STADIUM",
+		answer:
+			"The slowdown was caused by short-lived cell congestion during a high-density event, not by a persistent account or device problem.",
+		history: [
+			{
+				role: "user",
+				content:
+					"CUSTOMER THINKS THERE WAS AN OUTAGE WHEN DATA SLOWED DOWN AT A STADIUM",
+			},
+			{
+				role: "assistant",
+				content:
+					"Nearby cell sectors were saturated during the event window and recovered later the same evening. The line and handset remained healthy, so this is best handled as a temporary congestion explanation rather than a defect ticket.",
+			},
+		],
+		contextEntries: [
+			{ key: "impacted_device_type", value: "MSISDN" },
+			{ key: "metric_name", value: "user feedback" },
+			{
+				key: "resolution",
+				value: "SHORT-LIVED CELL CONGESTION DURING A HIGH-DENSITY EVENT",
+			},
+		],
+		toolCalls: [
+			{
+				id: "tool-401",
+				name: "get_location",
+				callType: "tool",
+				stepNumber: 1,
+				arguments: { msisdn: "[REDACTED_MSISDN_005]", context: null },
+				response: {
+					result: {
+						response: {
+							items: [{ valueObject: { cellSector: "STADIUM-12" } }],
+						},
+					},
+					executionTimeSeconds: 1.51,
+				},
+			},
+			{
+				id: "tool-402",
+				name: "qtm_cellsector_ref_query",
+				callType: "tool",
+				stepNumber: 2,
+				arguments: { sector: "STADIUM-12", hours: 12 },
+				response: {
+					result: {
+						response: {
+							items: [
+								{ valueObject: { congestionEvent: true, peakUsers: 1840 } },
+							],
+						},
+					},
+					executionTimeSeconds: 1.92,
+				},
+			},
+		],
+		feedback: [
+			{
+				source: "trace-export-ratings",
+				values: {
+					"The explanation and investigation areas were relevant to the issue": 1,
+				},
+			},
+		],
+		metadata: {
+			sourceFormat: "trace-export",
+			datasetTheme: "network-diagnostics",
+		},
+		traceIds: {
+			traceId: "demo-trace-005",
+			conversationId: "demo-cid-005",
+		},
+		status: "draft",
+		deleted: true,
+		tags: ["congestion", "event", "deleted"],
+		comment:
+			"Deleted sample keeps restore flows visible with real trace-like evidence.",
+		datasetName: "network-diagnostics",
 	},
 	{
-		id: "GT-0002",
+		id: "demo-kb-retrieval",
 		providerId: "json",
-		question: "What formats are supported for data export?",
+		question:
+			"What is our refund policy for services interrupted during a network outage?",
 		answer:
-			"The application supports exporting data in CSV, JSON, and XML formats. You can select your preferred format from the Export dialog.",
-		references: [
+			"Customers affected by confirmed outages lasting more than 4 hours are eligible for a pro-rated service credit. Credits are applied automatically within 2 billing cycles for outages flagged in the NOC system.",
+		history: [
+			{
+				role: "user",
+				content:
+					"What is our refund policy for services interrupted during a network outage?",
+			},
+			{
+				role: "agent",
+				content:
+					"Based on our internal knowledge base, customers affected by confirmed outages lasting more than 4 hours are eligible for a pro-rated service credit. The credit is calculated based on the duration of the outage relative to the billing period and is applied automatically within 2 billing cycles for outages flagged in the NOC system. For outages under 4 hours, agents may issue a goodwill credit at their discretion up to $25.",
+			},
+		],
+		contextEntries: [
+			{ key: "channel", value: "internal-agent-assist" },
+			{ key: "category", value: "billing-policy" },
+		],
+		toolCalls: [
+			{
+				id: "tool-kb-search-001",
+				name: "search_documents",
+				callType: "tool",
+				stepNumber: 1,
+				arguments: {
+					query: "refund policy network outage service credit",
+					top_k: 5,
+				},
+				response: {
+					result: {
+						documents: [
+							{
+								title: "Outage Credit Policy v3.2",
+								score: 0.94,
+							},
+							{
+								title: "Goodwill Credit Guidelines",
+								score: 0.87,
+							},
+						],
+					},
+					executionTimeSeconds: 0.82,
+				},
+			},
 			{
-				id: "r3",
-				title: "Data Export Capabilities",
-				url: "https://example.com/docs/data-export",
-				snippet:
-					"Supported formats include CSV for spreadsheets, JSON for web apps, and XML for legacy systems.",
-				visitedAt: null,
-				keyParagraph: "",
+				id: "tool-kb-search-002",
+				name: "search_documents",
+				callType: "tool",
+				stepNumber: 2,
+				arguments: {
+					query: "NOC outage credit automatic billing adjustment",
+					top_k: 3,
+				},
+				response: {
+					result: {
+						documents: [
+							{
+								title: "NOC-to-Billing Automation Runbook",
+								score: 0.91,
+							},
+						],
+					},
+					executionTimeSeconds: 0.64,
+				},
 			},
 		],
+		expectedTools: {
+			required: [{ name: "search_documents" }],
+		},
+		feedback: [
+			{
+				source: "curator-review",
+				values: {
+					"Answer is grounded in retrieved documents": 1,
+					"All relevant documents were retrieved": 2,
+				},
+			},
+		],
+		metadata: {
+			sourceFormat: "agent-trace",
+			datasetTheme: "knowledge-retrieval",
+		},
+		traceIds: {
+			traceId: "demo-trace-kb-001",
+			conversationId: "demo-cid-kb-001",
+		},
+		plugins: {
+			"rag-compat": {
+				kind: "rag-compat",
+				version: "1.0",
+				data: {
+					retrievals: {
+						"tool-kb-search-001": {
+							candidates: [
+								{
+									url: "https://kb.example.com/policies/outage-credit-v3.2",
+									title: "Outage Credit Policy v3.2",
+									chunk:
+										"Customers on postpaid plans who experience a confirmed outage exceeding 4 continuous hours are entitled to a pro-rated service credit equal to the outage duration divided by the billing period.",
+									toolCallId: "tool-kb-search-001",
+									relevance: "relevant",
+								},
+								{
+									url: "https://kb.example.com/policies/goodwill-credit",
+									title: "Goodwill Credit Guidelines",
+									chunk:
+										"For outages under the 4-hour threshold, agents may issue a discretionary goodwill credit of up to $25 per incident without supervisor approval.",
+									toolCallId: "tool-kb-search-001",
+									relevance: "partially_relevant",
+								},
+							],
+						},
+						"tool-kb-search-002": {
+							candidates: [
+								{
+									url: "https://kb.example.com/runbooks/noc-billing-automation",
+									title: "NOC-to-Billing Automation Runbook",
+									chunk:
+										"Credits for NOC-flagged outages are applied automatically within 2 billing cycles. No manual agent action is required for confirmed outages in the NOC system.",
+									toolCallId: "tool-kb-search-002",
+									relevance: "relevant",
+								},
+							],
+						},
+					},
+				},
+			},
+		},
 		status: "draft",
 		deleted: false,
-		tags: ["export", "data", "formats"],
-		comment: undefined,
+		tags: ["retrieval", "policy", "billing"],
+		comment:
+			"Demo item showcasing retrieval tool calls with per-call reference management.",
 		curationInstructions: `
-### Curation Guidelines (Data Export)
+### Curation Guidelines (Knowledge Retrieval)
 
-- List all supported formats explicitly.
-- Mention where the Export dialog is located if not obvious.
+- Verify the answer is grounded in the retrieved documents.
+- Check that all relevant references are associated with the correct tool call.
+- Mark references as relevant, partially relevant, or not relevant.
 `,
+		datasetName: "knowledge-retrieval",
 	},
 ];
 
diff --git a/frontend/src/models/groundTruth.ts b/frontend/src/models/groundTruth.ts
index 9a1d332..3592650 100644
--- a/frontend/src/models/groundTruth.ts
+++ b/frontend/src/models/groundTruth.ts
@@ -1,5 +1,207 @@
 // Domain models and constants for Ground Truth items
 
+// ---------------------------------------------------------------------------
+// Generic schema types (aligned with gt_schema_v5_generic.py and generated API)
+// ---------------------------------------------------------------------------
+
+/** A key-value pair of context provided to the agent scenario. */
+export type ContextEntry = {
+	key: string;
+	value: unknown;
+};
+
+/** A record of a single tool or sub-agent call made during execution. */
+export type ToolCallRecord = {
+	id: string;
+	name: string;
+	callType: "tool" | "subagent";
+	arguments?: Record<string, unknown>;
+	agent?: string | null;
+	stepNumber?: number | null;
+	parallelGroup?: string | null;
+	parentCallId?: string | null;
+	response?: unknown;
+};
+
+/** A single tool expectation within an expected-tools group. */
+export type ToolExpectation = {
+	name: string;
+	arguments?: Record<string, unknown> | string | null;
+};
+
+/**
+ * Item-level expected tool specification.
+ * Tools are implicitly allowed unless listed here.
+ */
+export type ExpectedTools = {
+	required?: ToolExpectation[];
+	optional?: ToolExpectation[];
+	notNeeded?: ToolExpectation[];
+};
+
+/** Curator or automated feedback attached to an item. */
+export type FeedbackEntry = {
+	source: string;
+	values?: Record<string, unknown>;
+};
+
+/** An opaque plugin payload stored under a named slot. */
+export type PluginPayload = {
+	kind: string;
+	version: string;
+	data?: Record<string, unknown>;
+};
+
+/**
+ * A single retrieval result that can be associated with a specific tool call.
+ * Supports per-tool-call ownership instead of flat top-level references,
+ * and preserves the raw search payload alongside normalised fields.
+ */
+export type RetrievalCandidate = {
+	url: string;
+	title?: string;
+	chunk?: string;
+	rawPayload?: Record<string, unknown>;
+	relevance?: "relevant" | "partially_relevant" | "not_relevant";
+	toolCallId?: string;
+};
+
+// ---------------------------------------------------------------------------
+// Per-call retrieval helpers (Phase 6 — retrieval normalization)
+//
+// References are stored in plugins["rag-compat"].data.retrievals per tool
+// call.  The helpers below provide flat Reference[] access for UI
+// components that still consume the legacy Reference shape.
+// ---------------------------------------------------------------------------
+
+const _RAG_COMPAT_KEY = "rag-compat";
+const _UNASSOCIATED_KEY = "_unassociated";
+
+/** Per-call retrieval bucket as stored in plugin data. */
+type RetrievalBucket = {
+	candidates: Array<{
+		url: string;
+		title?: string;
+		chunk?: string;
+		rawPayload?: Record<string, unknown>;
+		relevance?: string;
+		toolCallId?: string | null;
+		messageIndex?: number;
+		turnId?: string;
+		keyParagraph?: string;
+		bonus?: boolean;
+		visitedAt?: string | null;
+	}>;
+};
+
+/** Typed shorthand for the retrievals dict inside rag-compat plugin data. */
+type RetrievalsMap = Record<string, RetrievalBucket>;
+
+/**
+ * Read the per-call retrievals map from plugin data.
+ * Returns `undefined` when no per-call state exists.
+ */
+export function getRetrievalsMap(
+	item: Pick<GroundTruthItem, "plugins">,
+): RetrievalsMap | undefined {
+	const data = item.plugins?.[_RAG_COMPAT_KEY]?.data;
+	if (!data) return undefined;
+	const r = data.retrievals;
+	if (r && typeof r === "object" && !Array.isArray(r)) {
+		return r as RetrievalsMap;
+	}
+	return undefined;
+}
+
+/**
+ * Extract a flat Reference[] from per-call retrieval state.
+ *
+ * Read path: returns per-call candidates when present, mapped to the legacy
+ * Reference shape.  Falls back to an empty array when no per-call state
+ * exists (caller should provide legacy references separately if needed).
+ */
+export function getItemReferences(item: GroundTruthItem): Reference[] {
+	const retrievals = getRetrievalsMap(item);
+	if (!retrievals) return [];
+	const history = ensureConversationTurnIdentity(item.history);
+	const indexByTurnId = getTurnIndexById(history);
+
+	const refs: Reference[] = [];
+	let refIndex = 0;
+	for (const [toolCallId, bucket] of Object.entries(retrievals)) {
+		if (!bucket?.candidates) continue;
+		for (const c of bucket.candidates) {
+			const storedTurnId = c.turnId;
+			const resolvedMessageIndex =
+				storedTurnId && indexByTurnId.has(storedTurnId)
+					? indexByTurnId.get(storedTurnId)
+					: c.messageIndex;
+			const resolvedTurnId =
+				storedTurnId ||
+				(typeof resolvedMessageIndex === "number"
+					? history[resolvedMessageIndex]?.turnId
+					: undefined);
+			refs.push({
+				id: `ref_${refIndex++}`,
+				title: c.title,
+				url: c.url,
+				snippet: c.chunk,
+				visitedAt: c.visitedAt ?? null,
+				keyParagraph: c.keyParagraph,
+				bonus: c.bonus ?? false,
+				messageIndex: resolvedMessageIndex,
+				turnId: resolvedTurnId,
+				toolCallId: toolCallId !== _UNASSOCIATED_KEY ? toolCallId : undefined,
+			});
+		}
+	}
+	return refs;
+}
+
+/**
+ * Return a new item with references written into per-call plugin state.
+ * Groups references by `toolCallId` (falling back to _unassociated).
+ * Immutable — returns a new object.
+ */
+export function withUpdatedReferences(
+	item: GroundTruthItem,
+	refs: Reference[],
+): GroundTruthItem {
+	const retrievals: RetrievalsMap = {};
+	for (const ref of refs) {
+		const key = ref.toolCallId || _UNASSOCIATED_KEY;
+		if (!retrievals[key]) {
+			retrievals[key] = { candidates: [] };
+		}
+		retrievals[key].candidates.push({
+			url: ref.url,
+			title: ref.title,
+			chunk: ref.snippet,
+			relevance: undefined,
+			toolCallId: ref.toolCallId || undefined,
+			messageIndex: ref.turnId ? undefined : ref.messageIndex,
+			turnId: ref.turnId,
+			keyParagraph: ref.keyParagraph,
+			bonus: ref.bonus,
+			visitedAt: ref.visitedAt,
+		});
+	}
+
+	const plugins = { ...(item.plugins || {}) };
+	const existing = plugins[_RAG_COMPAT_KEY];
+	plugins[_RAG_COMPAT_KEY] = {
+		kind: _RAG_COMPAT_KEY,
+		version: existing?.version || "1.0",
+		data: { ...(existing?.data || {}), retrievals },
+	};
+
+	return { ...item, plugins };
+}
+
+// ---------------------------------------------------------------------------
+// Existing types kept for backward compat
+// ---------------------------------------------------------------------------
+
 export type ExpectedBehavior =
 	| "tool:search"
 	| "generation:answer"
@@ -8,9 +210,15 @@ export type ExpectedBehavior =
 	| "generation:out-of-domain";
 
 export type ConversationTurn = {
-	role: "user" | "agent";
+	/** Stable identity for canonical multi-turn editing state. */
+	turnId?: string;
+	/** Stable workflow-step identity when a turn maps to a durable step. */
+	stepId?: string;
+	/** Free-form role string. "user" marks the human turn; any other value is a non-user (agent/assistant) turn.
+	 *  Common values: "user", "agent", "assistant", "output-agent", "orchestrator-agent". */
+	role: string;
 	content: string;
-	/** Expected behavior(s) for this turn in the conversation (agent turns only) */
+	/** Expected behavior(s) for this turn in the conversation (agent turns only, legacy/compat) */
 	expectedBehavior?: ExpectedBehavior[];
 };
 
@@ -23,17 +231,98 @@ export type Reference = {
 	keyParagraph?: string;
 	// Mark as bonus (additional context)
 	bonus?: boolean;
-	// Which agent turn these refs belong to (optional)
+	// Which agent turn these refs belong to (optional, legacy association)
 	messageIndex?: number;
+	// Stable turn ownership for canonical multi-turn editing state.
+	turnId?: string;
+	// Which tool call these refs belong to (per-call retrieval state)
+	toolCallId?: string;
 };
 
+export function getTurnIndexById(
+	history?: ConversationTurn[],
+): Map<string, number> {
+	return new Map(
+		ensureConversationTurnIdentity(history)
+			.map((turn, index) =>
+				turn.turnId ? ([turn.turnId, index] as const) : null,
+			)
+			.filter((entry): entry is readonly [string, number] => entry !== null),
+	);
+}
+
+export function getReferenceMessageIndex(
+	ref: Pick<Reference, "messageIndex" | "turnId">,
+	history?: ConversationTurn[],
+): number | undefined {
+	if (ref.turnId) {
+		return getTurnIndexById(history).get(ref.turnId);
+	}
+	return ref.messageIndex;
+}
+
+function getReferenceChunkIdentityKey(
+	ref: Pick<Reference, "snippet" | "keyParagraph">,
+): string | null {
+	const snippet = ref.snippet?.trim();
+	const keyParagraph = ref.keyParagraph?.trim();
+	if (!snippet && !keyParagraph) {
+		return null;
+	}
+	return JSON.stringify([snippet ?? null, keyParagraph ?? null]);
+}
+
+export function getReferenceIdentityKey(
+	ref: Pick<
+		Reference,
+		| "url"
+		| "toolCallId"
+		| "turnId"
+		| "messageIndex"
+		| "snippet"
+		| "keyParagraph"
+	>,
+): string {
+	const ownerKey = ref.toolCallId ? `tool:${ref.toolCallId}` : "tool:none";
+	const turnKey = ref.turnId
+		? `turn:${ref.turnId}`
+		: ref.messageIndex !== undefined
+			? `index:${ref.messageIndex}`
+			: "index:none";
+	const chunkKey = getReferenceChunkIdentityKey(ref);
+	return chunkKey
+		? `${ownerKey}::${turnKey}::${ref.url}::chunk:${chunkKey}`
+		: `${ownerKey}::${turnKey}::${ref.url}`;
+}
+
 export type GroundTruthItem = {
 	id: string;
-	question: string;
-	answer: string;
-	// NEW: full conversation history for multi-turn support
+	// ---------------------------------------------------------------------------
+	// Generic schema fields (Phase 2+)
+	// ---------------------------------------------------------------------------
+	/** Conversation history. Free-form roles; "user" marks human turns. */
 	history?: ConversationTurn[];
-	references: Reference[];
+	/** Scenario identifier linking this item to an originating scenario. */
+	scenarioId?: string;
+	/** Context entries provided to the agent (key-value pairs). */
+	contextEntries?: ContextEntry[];
+	/** Tool call records captured during agent execution. */
+	toolCalls?: ToolCallRecord[];
+	/** Item-level tool expectations (required / optional / not-needed). */
+	expectedTools?: ExpectedTools;
+	/** Feedback entries from curators or automated systems. */
+	feedback?: FeedbackEntry[];
+	/** Arbitrary metadata dictionary for trace info and other extensions. */
+	metadata?: Record<string, unknown>;
+	/** Plugin-specific payloads keyed by plugin slot name. */
+	plugins?: Record<string, PluginPayload>;
+	/** Trace correlation IDs (e.g., conversationId, sessionId). */
+	traceIds?: Record<string, string> | null;
+	/** Full raw trace payload for evidence review. */
+	tracePayload?: Record<string, unknown>;
+	// ---------------------------------------------------------------------------
+	// Common lifecycle and metadata fields
+	// ---------------------------------------------------------------------------
 	status: "draft" | "approved" | "skipped" | "deleted";
 	providerId: string; // e.g., 'json'
 	deleted?: boolean; // soft delete flag (sidebar still shows it)
@@ -49,6 +338,10 @@ export type GroundTruthItem = {
 	datasetName?: string;
 	/** Optional storage bucket when sourced from API-backed provider. */
 	bucket?: string;
+	/** Legacy compatibility projection derived from history when absent. */
+	question?: string;
+	/** Legacy compatibility projection derived from history when absent. */
+	answer?: string;
 	/** ISO date string of the last review, when provided by the API. */
 	reviewedAt?: string | null;
 	/**
@@ -64,50 +357,145 @@ export type GroundTruthItem = {
 
 // Helper functions for multi-turn support
 
+const LEGACY_HOST_DELETE_GATES = [
+	"stored-data audit completed",
+	"caller audit completed",
+	"import/export verification completed",
+] as const;
+
+export type LegacyHostDeleteGate = (typeof LEGACY_HOST_DELETE_GATES)[number];
+
+export function getLegacyHostDeleteGates(): LegacyHostDeleteGate[] {
+	return [...LEGACY_HOST_DELETE_GATES];
+}
+
+export function createConversationTurn(args: {
+	role: string;
+	content: string;
+	turnId?: string;
+	stepId?: string;
+	expectedBehavior?: ExpectedBehavior[];
+}): ConversationTurn {
+	return {
+		turnId: args.turnId || `turn_${Math.random().toString(36).slice(2, 10)}`,
+		stepId: args.stepId,
+		role: args.role,
+		content: args.content,
+		expectedBehavior: args.expectedBehavior,
+	};
+}
+
+export function ensureConversationTurnIdentity(
+	history?: ConversationTurn[],
+): ConversationTurn[] {
+	return (history || []).map((turn, index) => ({
+		...turn,
+		turnId: turn.turnId || turn.stepId || `turn_${index + 1}`,
+		stepId: turn.stepId || turn.turnId || `step_${index + 1}`,
+	}));
+}
+
 /**
- * Returns the last user message from history, or falls back to item.question
+ * Returns the last user message from history.
  */
 export function getLastUserTurn(item: GroundTruthItem): string {
-	if (!item.history || item.history.length === 0) {
-		return item.question;
+	if (!Array.isArray(item.history)) {
+		return item.question || "";
+	}
+	const history = ensureConversationTurnIdentity(item.history);
+	if (history.length === 0) {
+		return "";
 	}
 	// Find the last user turn
-	for (let i = item.history.length - 1; i >= 0; i--) {
-		if (item.history[i].role === "user") {
-			return item.history[i].content;
+	for (let i = history.length - 1; i >= 0; i--) {
+		if (history[i].role === "user") {
+			return history[i].content;
 		}
 	}
-	return item.question;
+	return "";
 }
 
 /**
- * Returns the last agent message from history, or falls back to item.answer
+ * Returns the last agent message from history.
+ * "Agent" is any turn whose role is not "user" (supports free-form roles).
  */
 export function getLastAgentTurn(item: GroundTruthItem): string {
-	if (!item.history || item.history.length === 0) {
-		return item.answer;
+	if (!Array.isArray(item.history)) {
+		return item.answer || "";
 	}
-	// Find the last agent turn
-	for (let i = item.history.length - 1; i >= 0; i--) {
-		if (item.history[i].role === "agent") {
-			return item.history[i].content;
+	const history = ensureConversationTurnIdentity(item.history);
+	if (history.length === 0) {
+		return "";
+	}
+	// Find the last non-user turn (any agent/assistant/orchestrator role)
+	for (let i = history.length - 1; i >= 0; i--) {
+		if (history[i].role !== "user") {
+			return history[i].content;
 		}
 	}
-	return item.answer;
+	return "";
 }
 
 /**
  * Returns the total number of turns in the conversation
  */
 export function getTurnCount(item: GroundTruthItem): number {
-	return item.history?.length || 0;
+	return ensureConversationTurnIdentity(item.history).length;
 }
 
 /**
  * Checks if the item is using multi-turn mode
  */
 export function isMultiTurn(item: GroundTruthItem): boolean {
-	return !!item.history && item.history.length > 0;
+	return ensureConversationTurnIdentity(item.history).length > 0;
+}
+
+/**
+ * Returns a short preview string for queue display:
+ * uses the first user turn from history.
+ */
+export function getQueuePreview(item: GroundTruthItem): string {
+	if (!Array.isArray(item.history)) {
+		return item.question || "(no message)";
+	}
+	const first = ensureConversationTurnIdentity(item.history).find(
+		(t) => t.role === "user",
+	);
+	return first?.content || "(no message)";
+}
+
+export function withDerivedLegacyFields(
+	item: GroundTruthItem,
+): GroundTruthItem {
+	const history = Array.isArray(item.history)
+		? ensureConversationTurnIdentity(item.history)
+		: item.history;
+	const derivedItem = {
+		...item,
+		history,
+	};
+	return {
+		...derivedItem,
+		question: getLastUserTurn(derivedItem),
+		answer: getLastAgentTurn(derivedItem),
+	};
+}
+
+/**
+ * Returns whether the item has any generic evidence data worth showing
+ * in the evidence/trace panel (toolCalls, expectedTools, traceIds, metadata, feedback).
+ */
+export function hasEvidenceData(item: GroundTruthItem): boolean {
+	return (
+		(item.contextEntries?.length ?? 0) > 0 ||
+		(item.toolCalls?.length ?? 0) > 0 ||
+		item.expectedTools != null ||
+		item.traceIds != null ||
+		Object.keys(item.metadata ?? {}).length > 0 ||
+		Object.keys(item.plugins ?? {}).length > 0 ||
+		(item.feedback?.length ?? 0) > 0 ||
+		Object.keys(item.tracePayload ?? {}).length > 0
+	);
 }
 
 /**
@@ -126,7 +514,12 @@ export function formatConversationForAgent(
 	const slice = turns.slice(0, end);
 	return slice
 		.map((turn) => {
-			const label = turn.role === "agent" ? "Agent" : "User";
+			const label =
+				turn.role === "user"
+					? "User"
+					: turn.role === "agent"
+						? "Agent"
+						: turn.role;
 			const body = (turn.content || "").trim();
 			return body ? `${label}: ${body}` : `${label}:`;
 		})
diff --git a/frontend/src/models/gtHelpers.ts b/frontend/src/models/gtHelpers.ts
index acfe93a..04b8c52 100644
--- a/frontend/src/models/gtHelpers.ts
+++ b/frontend/src/models/gtHelpers.ts
@@ -1,47 +1,35 @@
-import { getCachedConfig } from "../services/runtimeConfig";
 import type { GroundTruthItem, Reference } from "./groundTruth";
-import { refsApprovalReady, validateConversationPattern } from "./validators";
+import { getItemReferences, getReferenceIdentityKey } from "./groundTruth";
+import {
+	type ReferenceApprovalRequirements,
+	refsApprovalReady,
+	validateConversationPattern,
+	validateExpectedTools,
+} from "./validators";
 
-// Get config value for reference visit requirement (default: true)
-const requireReferenceVisit = () => {
-	const config = getCachedConfig();
-	if (config !== null) {
-		return config.requireReferenceVisit;
-	}
-	// Fallback to env var if config not loaded yet (shouldn't happen in normal flow)
-	const val = import.meta.env.VITE_REQUIRE_REFERENCE_VISIT;
-	if (val === undefined || val === null) return true;
-	if (typeof val === "boolean") return val;
-	return val !== "false" && val !== "0";
-};
-
-// Get config value for key paragraph requirement (default: false)
-const requireKeyParagraph = () => {
-	const config = getCachedConfig();
-	if (config !== null) {
-		return config.requireKeyParagraph;
-	}
-	// Fallback to env var if config not loaded yet (shouldn't happen in normal flow)
-	const val = import.meta.env.VITE_REQUIRE_KEY_PARAGRAPH;
-	if (val === undefined || val === null) return false;
-	if (typeof val === "boolean") return val;
-	return val === "true" || val === "1";
-};
+/**
+ * Check whether a plugin declares exemption from the required-tools check.
+ * A plugin payload with `data.canBypassRequiredTools: true` opts the item
+ * out of the ≥1 required tool gate.
+ */
+export function canBypassRequiredToolsCheck(item: GroundTruthItem): boolean {
+	if (!item.plugins) return false;
+	return Object.values(item.plugins).some(
+		(p) => p.data?.canBypassRequiredTools === true,
+	);
+}
 
-// Dedupe references by URL and messageIndex combination
-// In multi-turn contexts, the same URL can exist for different turns
-// In single-turn contexts (no messageIndex), dedupe by URL only
+// Dedupe references by canonical turn ownership first, then by compatibility
+// messageIndex when no stable turnId is present.
 export function dedupeReferences(
 	existing: Reference[],
 	chosen: Reference[],
 ): Reference[] {
-	// Create a composite key: URL + messageIndex (or URL only if no messageIndex)
-	const makeKey = (r: Reference) =>
-		r.messageIndex !== undefined ? `${r.url}::turn${r.messageIndex}` : r.url;
-
-	const map = new Map(existing.map((r) => [makeKey(r), r] as const));
+	const map = new Map(
+		existing.map((r) => [getReferenceIdentityKey(r), r] as const),
+	);
 	for (const r of chosen) {
-		const key = makeKey(r);
+		const key = getReferenceIdentityKey(r);
 		if (!map.has(key)) {
 			map.set(key, r);
 		}
@@ -49,54 +37,47 @@ export function dedupeReferences(
 	return Array.from(map.values());
 }
 
-// Determine if an item can be approved (single-turn or multi-turn)
+// Determine if an item can be approved (generic or single-turn)
 export function canApproveCandidate(
 	item: GroundTruthItem | null | undefined,
+	approvalRequirements?: ReferenceApprovalRequirements,
 ): boolean {
 	if (!item) return false;
 	if (item.deleted) return false;
 
-	// Check if multi-turn
+	// Check if multi-turn or generic (has history)
 	if (item.history && item.history.length > 0) {
-		return canApproveMultiTurn(item);
+		return canApproveMultiTurn(item, approvalRequirements);
 	}
 
-	// Single-turn validation (existing logic)
-	const hasReferences =
-		Array.isArray(item.references) && item.references.length > 0;
-	return hasReferences && refsApprovalReady(item);
+	// Single-turn fallback (compatibility — kept for items without history)
+	const refs = getItemReferences(item);
+	const hasReferences = refs.length > 0;
+	return hasReferences && refsApprovalReady(item, approvalRequirements);
 }
 
-// Determine if a multi-turn item can be approved
+// Determine if a multi-turn / generic item can be approved.
+// Generic approval gate: valid conversation pattern + not deleted +
+// ≥1 required expected tool (unless plugin bypass) +
+// all required expected tools present in toolCalls (when specified).
 export function canApproveMultiTurn(
 	item: GroundTruthItem | null | undefined,
+	approvalRequirements?: ReferenceApprovalRequirements,
 ): boolean {
 	if (!item || !item.history || item.history.length === 0) return false;
 	if (item.deleted) return false;
 
-	// Validate conversation pattern (user → agent alternating, complete pairs)
+	// Validate conversation pattern (starts with user, pairs complete)
 	const patternValidation = validateConversationPattern(item.history);
 	if (!patternValidation.valid) return false;
 
-	// Check that all agent turns have at least one expected behavior (REQUIRED)
-	const allAgentTurnsHaveExpectedBehavior = item.history
-		.filter((turn) => turn.role === "agent")
-		.every((turn) => turn.expectedBehavior && turn.expectedBehavior.length > 0);
-	if (!allAgentTurnsHaveExpectedBehavior) return false;
-
-	// Check if all references must be visited (configurable)
-	if (requireReferenceVisit()) {
-		const allVisited = item.references.every((r) => Boolean(r.visitedAt));
-		if (!allVisited) return false;
-	}
+	// Require at least one required tool unless a plugin overrides this gate
+	const hasRequired = (item.expectedTools?.required?.length ?? 0) > 0;
+	if (!hasRequired && !canBypassRequiredToolsCheck(item)) return false;
 
-	// Check if key paragraphs are required (configurable)
-	if (requireKeyParagraph()) {
-		const allHaveKeyParagraph = item.references.every(
-			(r) => r.keyParagraph && r.keyParagraph.trim().length >= 40,
-		);
-		if (!allHaveKeyParagraph) return false;
-	}
+	// Validate expected tools when the item defines required tools
+	const toolValidation = validateExpectedTools(item);
+	if (!toolValidation.valid) return false;
 
-	return true;
+	return refsApprovalReady(item, approvalRequirements);
 }
diff --git a/frontend/src/models/validators.ts b/frontend/src/models/validators.ts
index b4fad68..0228e01 100644
--- a/frontend/src/models/validators.ts
+++ b/frontend/src/models/validators.ts
@@ -1,31 +1,86 @@
-import { getCachedConfig } from "../services/runtimeConfig";
+import {
+	getRuntimeConfigSnapshot,
+	type RuntimeConfig,
+} from "../services/runtimeConfig";
 import type { ConversationTurn, GroundTruthItem } from "./groundTruth";
+import { getItemReferences } from "./groundTruth";
 
-// Get config value for reference visit requirement (default: true)
-const requireReferenceVisit = () => {
-	const config = getCachedConfig();
-	if (config !== null) {
-		return config.requireReferenceVisit;
+// ---------------------------------------------------------------------------
+// Expected-tools validation
+// ---------------------------------------------------------------------------
+
+/**
+ * Result of validating an item's expectedTools against its actual toolCalls.
+ */
+export type ExpectedToolsValidationResult = {
+	/** True when all required tools are present in toolCalls (or no requirements). */
+	valid: boolean;
+	/** Names of required tools that were not found in toolCalls. */
+	missingRequired: string[];
+	/** Human-readable error messages, one per missing required tool. */
+	errors: string[];
+};
+
+/**
+ * Validates that every tool listed under `expectedTools.required` appears at
+ * least once in `toolCalls`.  Optional and notNeeded buckets are informational
+ * and do not affect the result.
+ *
+ * Returns `valid: true` when:
+ * - `expectedTools` is absent or has no required tools, OR
+ * - All required tools appear in `toolCalls`.
+ */
+export function validateExpectedTools(
+	item: GroundTruthItem,
+): ExpectedToolsValidationResult {
+	const required = item.expectedTools?.required;
+	if (!required?.length) {
+		return { valid: true, missingRequired: [], errors: [] };
 	}
-	// Fallback to env var if config not loaded yet (shouldn't happen in normal flow)
+
+	const calledNames = new Set((item.toolCalls ?? []).map((tc) => tc.name));
+	const missingRequired = required
+		.filter((te) => !calledNames.has(te.name))
+		.map((te) => te.name);
+
+	return {
+		valid: missingRequired.length === 0,
+		missingRequired,
+		errors: missingRequired.map(
+			(name) => `Required tool "${name}" was not called`,
+		),
+	};
+}
+
+export type ReferenceApprovalRequirements = Pick<
+	RuntimeConfig,
+	"requireReferenceVisit" | "requireKeyParagraph"
+>;
+
+function getFallbackRequireReferenceVisit() {
 	const val = import.meta.env.VITE_REQUIRE_REFERENCE_VISIT;
 	if (val === undefined || val === null) return true;
 	if (typeof val === "boolean") return val;
 	return val !== "false" && val !== "0";
-};
+}
 
-// Get config value for key paragraph requirement (default: false)
-const requireKeyParagraph = () => {
-	const config = getCachedConfig();
-	if (config !== null) {
-		return config.requireKeyParagraph;
-	}
-	// Fallback to env var if config not loaded yet (shouldn't happen in normal flow)
+function getFallbackRequireKeyParagraph() {
 	const val = import.meta.env.VITE_REQUIRE_KEY_PARAGRAPH;
 	if (val === undefined || val === null) return false;
 	if (typeof val === "boolean") return val;
 	return val === "true" || val === "1";
-};
+}
+
+export function getReferenceApprovalRequirements(
+	config: RuntimeConfig | null = getRuntimeConfigSnapshot(),
+): ReferenceApprovalRequirements {
+	return {
+		requireReferenceVisit:
+			config?.requireReferenceVisit ?? getFallbackRequireReferenceVisit(),
+		requireKeyParagraph:
+			config?.requireKeyParagraph ?? getFallbackRequireKeyParagraph(),
+	};
+}
 
 /**
  * Validation result for multi-turn conversations.
@@ -37,11 +92,14 @@ type ConversationValidationResult = {
 };
 
 /**
- * Validates that a conversation follows the required pattern:
- * - Must start with a user turn
- * - Must alternate between user and agent turns
- * - Every user turn must have a corresponding agent turn
- * - The conversation should end with an agent turn for approval
+ * Validates that a conversation meets minimum structural requirements:
+ * - Must have at least one turn
+ * - Must start with a user turn (role === "user")
+ * - Must end with a non-user (agent) turn
+ *
+ * Consecutive turns of the same role (e.g. multiple agent responses) are
+ * allowed to support agentic workflows such as orchestrator → sub-agent or
+ * separate chat_response and RCA turns.
  *
  * @param history - The conversation history to validate
  * @returns Validation result with any errors found
@@ -61,24 +119,9 @@ export function validateConversationPattern(
 		errors.push("Conversation must start with a user turn");
 	}
 
-	// Check alternating pattern and that every user has an agent response
-	for (let i = 0; i < history.length; i++) {
-		const currentTurn = history[i];
-		const expectedRole = i % 2 === 0 ? "user" : "agent";
-
-		if (currentTurn.role !== expectedRole) {
-			errors.push(
-				`Turn ${i + 1} should be a ${expectedRole} turn, but found ${currentTurn.role} turn`,
-			);
-		}
-	}
-
-	// For approval, the conversation should end with an agent turn (even index count)
-	// This ensures every user query has an agent response
-	if (history.length % 2 !== 0) {
-		errors.push(
-			"Conversation must end with an agent response (every user turn needs an agent response)",
-		);
+	// Must end with an agent (non-user) turn so every user query has a response
+	if (history[history.length - 1].role === "user") {
+		errors.push("Conversation must end with an agent response");
 	}
 
 	return {
@@ -88,19 +131,23 @@ export function validateConversationPattern(
 }
 
 // Validation helper (SELF-TESTED)
-export function refsApprovalReady(it: GroundTruthItem): boolean {
+export function refsApprovalReady(
+	it: GroundTruthItem,
+	requirements = getReferenceApprovalRequirements(),
+): boolean {
+	const refs = getItemReferences(it);
 	// Rule: Approval is allowed with zero references.
-	if (!it.references || it.references.length === 0) return true;
+	if (refs.length === 0) return true;
 
 	// Check if all references must be visited (configurable)
-	if (requireReferenceVisit()) {
-		const allVisited = it.references.every((r) => Boolean(r.visitedAt));
+	if (requirements.requireReferenceVisit) {
+		const allVisited = refs.every((r) => Boolean(r.visitedAt));
 		if (!allVisited) return false;
 	}
 
 	// Check if key paragraphs are required for all references (configurable)
-	if (requireKeyParagraph()) {
-		const allHaveKeyParagraph = it.references.every(
+	if (requirements.requireKeyParagraph) {
+		const allHaveKeyParagraph = refs.every(
 			(r) => r.keyParagraph && r.keyParagraph.trim().length >= 40,
 		);
 		if (!allHaveKeyParagraph) return false;
diff --git a/frontend/src/registry/ExplorerExtensions.ts b/frontend/src/registry/ExplorerExtensions.ts
new file mode 100644
index 0000000..b98b7ab
--- /dev/null
+++ b/frontend/src/registry/ExplorerExtensions.ts
@@ -0,0 +1,120 @@
+// Explorer extension types for plugin-contributed columns and filters.
+//
+// Plugin packs register ExplorerExtension objects that QuestionsExplorer
+// reads at render time to add dynamic columns and filter dimensions.
+
+import type { ComponentType } from "react";
+import type { GroundTruthItem } from "../models/groundTruth";
+
+// ---------------------------------------------------------------------------
+// Column extension
+// ---------------------------------------------------------------------------
+
+/** Props passed to custom cell renderers registered by plugins. */
+export type ExplorerCellProps = {
+	item: GroundTruthItem;
+};
+
+/** A single column contributed by a plugin pack. */
+export type ExplorerColumnExtension = {
+	/** Unique key used for sorting and identification. */
+	key: string;
+	/** Header label displayed in the explorer table. */
+	header: string;
+	/** Custom cell renderer. When absent, `getValue` is rendered as text. */
+	cellRenderer?: ComponentType<ExplorerCellProps>;
+	/** Extract a sortable/displayable value from an item (used when no cellRenderer). */
+	getValue: (item: GroundTruthItem) => string | number | null;
+	/** Column width hint (CSS value, e.g. "80px" or "6rem"). */
+	width?: string;
+	/** Whether this column supports client-side sorting. Default true. */
+	sortable?: boolean;
+};
+
+// ---------------------------------------------------------------------------
+// Filter extension
+// ---------------------------------------------------------------------------
+
+/** A single filter dimension contributed by a plugin pack. */
+export type ExplorerFilterExtension = {
+	/** Unique key for URL-sync and state management. */
+	key: string;
+	/** Human-readable label. */
+	label: string;
+	/** Compute the set of distinct filter options from the current items. */
+	getOptions: (items: GroundTruthItem[]) => string[];
+	/** Return true when an item matches the selected filter value. */
+	matches: (item: GroundTruthItem, selectedValue: string) => boolean;
+};
+
+// ---------------------------------------------------------------------------
+// Aggregate extension registration
+// ---------------------------------------------------------------------------
+
+/** A complete explorer extension bundle registered by a plugin pack. */
+export type ExplorerExtension = {
+	/** Plugin pack name that owns this extension. */
+	packName: string;
+	/** Additional columns to render in the explorer table. */
+	columns?: ExplorerColumnExtension[];
+	/** Additional filter dimensions for the filter bar. */
+	filters?: ExplorerFilterExtension[];
+};
+
+// ---------------------------------------------------------------------------
+// Registry (module-level singleton)
+// ---------------------------------------------------------------------------
+
+const _extensions: ExplorerExtension[] = [];
+
+/** Register an explorer extension (typically called at app startup). */
+export function registerExplorerExtension(ext: ExplorerExtension): void {
+	const existing = _extensions.find((e) => e.packName === ext.packName);
+	if (existing) {
+		// Replace in-place to support hot-reload during development.
+		const idx = _extensions.indexOf(existing);
+		_extensions[idx] = ext;
+		return;
+	}
+	_extensions.push(ext);
+}
+
+/** Return all registered explorer extensions. */
+export function getExplorerExtensions(): ReadonlyArray<ExplorerExtension> {
+	return _extensions;
+}
+
+/** Clear all registrations (for testing). */
+export function resetExplorerExtensions(): void {
+	_extensions.length = 0;
+}
+
+// ---------------------------------------------------------------------------
+// Built-in RAG compat extension (reference count column)
+// ---------------------------------------------------------------------------
+
+import { getItemReferences } from "../models/groundTruth";
+
+registerExplorerExtension({
+	packName: "rag-compat",
+	columns: [
+		{
+			key: "referenceCount",
+			header: "Refs",
+			width: "60px",
+			sortable: true,
+			getValue: (item) => getItemReferences(item).length,
+		},
+	],
+	filters: [
+		{
+			key: "hasReferences",
+			label: "Has References",
+			getOptions: () => ["yes", "no"],
+			matches: (item, value) => {
+				const hasRefs = getItemReferences(item).length > 0;
+				return value === "yes" ? hasRefs : !hasRefs;
+			},
+		},
+	],
+});
diff --git a/frontend/src/registry/FieldComponentRegistry.ts b/frontend/src/registry/FieldComponentRegistry.ts
new file mode 100644
index 0000000..53d07b4
--- /dev/null
+++ b/frontend/src/registry/FieldComponentRegistry.ts
@@ -0,0 +1,126 @@
+import type { ComponentType } from "react";
+import type { ToolCallRecord } from "../models/groundTruth";
+import type {
+	ComponentRegistration,
+	EditorProps,
+	FieldComponentRegistryAPI,
+	ToolCallExtensionRegistration,
+	ToolCallExtensionRegistryAPI,
+	ViewerProps,
+} from "./types";
+
+export class FieldComponentRegistry implements FieldComponentRegistryAPI {
+	private readonly store = new Map<string, ComponentRegistration>();
+
+	register(registration: ComponentRegistration): void {
+		if (import.meta.env.DEV && this.store.has(registration.discriminator)) {
+			console.warn(
+				`[FieldComponentRegistry] Duplicate registration for discriminator: ${registration.discriminator}`,
+			);
+		}
+		this.store.set(registration.discriminator, registration);
+	}
+
+	registerIfAbsent(registration: ComponentRegistration): void {
+		if (this.has(registration.discriminator)) {
+			return;
+		}
+		this.store.set(registration.discriminator, registration);
+	}
+
+	resolve(
+		discriminator: string,
+		mode: "viewer" | "editor",
+	): ComponentType<ViewerProps> | ComponentType<EditorProps> | undefined {
+		const exact = this.store.get(discriminator);
+		if (exact) {
+			return mode === "editor" ? (exact.editor ?? exact.viewer) : exact.viewer;
+		}
+
+		for (const [key, reg] of this.store) {
+			if (
+				discriminator.startsWith(key) &&
+				discriminator.charAt(key.length) === ":"
+			) {
+				return mode === "editor" ? (reg.editor ?? reg.viewer) : reg.viewer;
+			}
+		}
+
+		return undefined;
+	}
+
+	has(discriminator: string): boolean {
+		if (this.store.has(discriminator)) {
+			return true;
+		}
+		for (const key of this.store.keys()) {
+			if (
+				discriminator.startsWith(key) &&
+				discriminator.charAt(key.length) === ":"
+			) {
+				return true;
+			}
+		}
+		return false;
+	}
+
+	registrations(): ReadonlyArray<ComponentRegistration> {
+		return Array.from(this.store.values());
+	}
+
+	reset(): void {
+		this.store.clear();
+	}
+}
+
+export function toolCallDiscriminator(tc: ToolCallRecord): string {
+	return `toolCall:${tc.name}`;
+}
+
+export class ToolCallExtensions implements ToolCallExtensionRegistryAPI {
+	private readonly store = new Map<string, ToolCallExtensionRegistration>();
+
+	register(registration: ToolCallExtensionRegistration): void {
+		if (import.meta.env.DEV && this.store.has(registration.discriminator)) {
+			console.warn(
+				`[ToolCallExtensions] Replacing registration for discriminator: ${registration.discriminator}`,
+			);
+		}
+		this.store.set(registration.discriminator, registration);
+	}
+
+	resolveAll(
+		toolCall: ToolCallRecord,
+	): ReadonlyArray<ToolCallExtensionRegistration> {
+		const disc = toolCallDiscriminator(toolCall);
+		const matches: ToolCallExtensionRegistration[] = [];
+
+		for (const [key, reg] of this.store) {
+			const discriminatorMatch =
+				key === disc ||
+				(disc.startsWith(key) && disc.charAt(key.length) === ":");
+
+			if (!discriminatorMatch) continue;
+			if (reg.matches && !reg.matches(toolCall)) continue;
+
+			matches.push(reg);
+		}
+
+		return matches;
+	}
+
+	hasMatch(toolCall: ToolCallRecord): boolean {
+		return this.resolveAll(toolCall).length > 0;
+	}
+
+	registrations(): ReadonlyArray<ToolCallExtensionRegistration> {
+		return Array.from(this.store.values());
+	}
+
+	reset(): void {
+		this.store.clear();
+	}
+}
+
+export const fieldComponentRegistry = new FieldComponentRegistry();
+export const toolCallExtensions = new ToolCallExtensions();
diff --git a/frontend/src/registry/PluginErrorBoundary.tsx b/frontend/src/registry/PluginErrorBoundary.tsx
new file mode 100644
index 0000000..f56c3ae
--- /dev/null
+++ b/frontend/src/registry/PluginErrorBoundary.tsx
@@ -0,0 +1,41 @@
+import type { ErrorInfo, ReactNode } from "react";
+import { Component } from "react";
+
+type Props = {
+	/** Rendered when the child tree throws. */
+	fallback: ReactNode;
+	children: ReactNode;
+};
+
+type State = {
+	hasError: boolean;
+};
+
+/**
+ * Error boundary that catches render-time errors from plugin-contributed
+ * components and swaps in a fallback renderer so the rest of the UI stays
+ * intact.
+ */
+export class PluginErrorBoundary extends Component<Props, State> {
+	constructor(props: Props) {
+		super(props);
+		this.state = { hasError: false };
+	}
+
+	static getDerivedStateFromError(): State {
+		return { hasError: true };
+	}
+
+	override componentDidCatch(error: Error, info: ErrorInfo): void {
+		if (import.meta.env.DEV) {
+			console.error("[PluginErrorBoundary] Caught error:", error, info);
+		}
+	}
+
+	override render(): ReactNode {
+		if (this.state.hasError) {
+			return this.props.fallback;
+		}
+		return this.props.children;
+	}
+}
diff --git a/frontend/src/registry/RegistryRenderer.tsx b/frontend/src/registry/RegistryRenderer.tsx
new file mode 100644
index 0000000..4172754
--- /dev/null
+++ b/frontend/src/registry/RegistryRenderer.tsx
@@ -0,0 +1,91 @@
+import {
+	fieldComponentRegistry,
+	toolCallExtensions,
+} from "./FieldComponentRegistry";
+import { PluginErrorBoundary } from "./PluginErrorBoundary";
+import type { RenderContext, ToolCallActionProps, ViewerProps } from "./types";
+
+type RegistryRendererProps = {
+	discriminator: string;
+	data: unknown;
+	context: RenderContext;
+	mode: "viewer" | "editor";
+	onChange?: (data: unknown) => void;
+	onValidate?: (data: unknown) => string[];
+};
+
+function FallbackFor({ data, context }: ViewerProps) {
+	if (typeof data === "string") {
+		return (
+			<pre className="max-h-64 overflow-auto rounded-md bg-slate-100 p-2 text-xs text-slate-700">
+				{data}
+			</pre>
+		);
+	}
+	if (data !== null && typeof data === "object" && !Array.isArray(data)) {
+		return (
+			<div className="grid gap-2 text-xs text-slate-700">
+				{Object.entries(data as Record<string, unknown>).map(([key, value]) => (
+					<div key={key} className="rounded-md bg-slate-50 p-2">
+						<div className="font-medium text-slate-500">{key}</div>
+						<div className="mt-1 break-all">
+							{typeof value === "string" ? value : JSON.stringify(value)}
+						</div>
+					</div>
+				))}
+			</div>
+		);
+	}
+	return (
+		<pre className="max-h-64 overflow-auto rounded-md bg-slate-100 p-2 text-xs text-slate-700">
+			{JSON.stringify({ data, fieldPath: context.fieldPath }, null, 2)}
+		</pre>
+	);
+}
+
+export function RegistryRenderer({
+	discriminator,
+	data,
+	context,
+	mode,
+	onChange,
+	onValidate,
+}: RegistryRendererProps) {
+	const Resolved = fieldComponentRegistry.resolve(discriminator, mode);
+	const fallback = <FallbackFor data={data} context={context} />;
+
+	if (!Resolved) {
+		return fallback;
+	}
+
+	const props =
+		mode === "editor" && onChange
+			? { data, context, onChange, onValidate }
+			: { data, context };
+
+	return (
+		<PluginErrorBoundary fallback={fallback}>
+			{/* biome-ignore lint/suspicious/noExplicitAny: resolved component is typed at registration time */}
+			<Resolved {...(props as any)} />
+		</PluginErrorBoundary>
+	);
+}
+
+export function ToolCallExtensionRenderer(props: ToolCallActionProps) {
+	const matches = toolCallExtensions.resolveAll(props.toolCall);
+
+	if (matches.length === 0) return null;
+
+	return (
+		<>
+			{matches.map((reg) => {
+				const Comp = reg.component;
+				return (
+					<PluginErrorBoundary key={reg.discriminator} fallback={null}>
+						<Comp {...props} />
+					</PluginErrorBoundary>
+				);
+			})}
+		</>
+	);
+}
diff --git a/frontend/src/registry/index.ts b/frontend/src/registry/index.ts
new file mode 100644
index 0000000..f3e17b0
--- /dev/null
+++ b/frontend/src/registry/index.ts
@@ -0,0 +1,29 @@
+// Public API for the plugin component registry.
+
+// Side-effect imports: self-registering extensions
+import "./ragCompatToolCallExtension";
+
+export type {
+	ExplorerCellProps,
+	ExplorerColumnExtension,
+	ExplorerExtension,
+	ExplorerFilterExtension,
+} from "./ExplorerExtensions";
+export {
+	getExplorerExtensions,
+	registerExplorerExtension,
+	resetExplorerExtensions,
+} from "./ExplorerExtensions";
+export {
+	ToolCallExtensions,
+	toolCallDiscriminator,
+	toolCallExtensions,
+} from "./FieldComponentRegistry";
+export { PluginErrorBoundary } from "./PluginErrorBoundary";
+export { ToolCallExtensionRenderer } from "./RegistryRenderer";
+export type {
+	ToolCallActionContext,
+	ToolCallActionProps,
+	ToolCallExtensionRegistration,
+	ToolCallExtensionRegistryAPI,
+} from "./types";
diff --git a/frontend/src/registry/ragCompatToolCallExtension.ts b/frontend/src/registry/ragCompatToolCallExtension.ts
new file mode 100644
index 0000000..b992503
--- /dev/null
+++ b/frontend/src/registry/ragCompatToolCallExtension.ts
@@ -0,0 +1,41 @@
+// RAG-compat tool call extension registration.
+//
+// Registers a "references" action on retrieval-type tool calls so curators
+// can view and manage per-call references inline in the tool call card.
+//
+// This module self-registers on import (side-effect), following the same
+// pattern as ExplorerExtensions.ts.
+
+import ToolCallReferencesAction from "../components/app/editors/ToolCallReferencesAction";
+import { toolCallExtensions } from "./FieldComponentRegistry";
+
+/** Tool names that indicate a retrieval / search call. */
+const RETRIEVAL_TOOL_NAMES = new Set([
+	"search",
+	"retrieval",
+	"lookup",
+	"fetch",
+	"query",
+	"find",
+	"get_documents",
+	"search_documents",
+	"vector_search",
+]);
+
+function isRetrievalToolCall(name: string): boolean {
+	const lower = name.toLowerCase();
+	// Exact match
+	if (RETRIEVAL_TOOL_NAMES.has(lower)) return true;
+	// Substring match for compound names like "azure_search", "doc_retrieval"
+	for (const keyword of RETRIEVAL_TOOL_NAMES) {
+		if (lower.includes(keyword)) return true;
+	}
+	return false;
+}
+
+toolCallExtensions.register({
+	discriminator: "toolCall",
+	component: ToolCallReferencesAction,
+	displayName: "RAG References",
+	matches: (tc) => isRetrievalToolCall(tc.name),
+});
diff --git a/frontend/src/registry/types.ts b/frontend/src/registry/types.ts
new file mode 100644
index 0000000..baa4548
--- /dev/null
+++ b/frontend/src/registry/types.ts
@@ -0,0 +1,80 @@
+import type { ComponentType } from "react";
+import type {
+	GroundTruthItem,
+	Reference,
+	ToolCallRecord,
+} from "../models/groundTruth";
+
+// ---------------------------------------------------------------------------
+// Field component registry
+// ---------------------------------------------------------------------------
+
+export type RenderContext = {
+	itemId: string;
+	fieldPath: string;
+	pluginKind?: string;
+	readOnly: boolean;
+};
+
+export type ViewerProps = {
+	data: unknown;
+	context: RenderContext;
+};
+
+export type EditorProps = ViewerProps & {
+	onChange: (data: unknown) => void;
+	onValidate?: (data: unknown) => string[];
+};
+
+export type ComponentRegistration = {
+	discriminator: string;
+	viewer: ComponentType<ViewerProps>;
+	editor?: ComponentType<EditorProps>;
+	displayName: string;
+};
+
+export type FieldComponentRegistryAPI = {
+	register(registration: ComponentRegistration): void;
+	registerIfAbsent(registration: ComponentRegistration): void;
+	resolve(
+		discriminator: string,
+		mode: "viewer" | "editor",
+	): ComponentType<ViewerProps> | ComponentType<EditorProps> | undefined;
+	registrations(): ReadonlyArray<ComponentRegistration>;
+	has(discriminator: string): boolean;
+};
+
+// ---------------------------------------------------------------------------
+// Tool call extension registry
+// ---------------------------------------------------------------------------
+
+export type ToolCallActionContext = {
+	item: GroundTruthItem;
+	readOnly: boolean;
+};
+
+export type ToolCallActionProps = {
+	toolCall: ToolCallRecord;
+	context: ToolCallActionContext;
+	references: Reference[];
+	onAddReferences?: (refs: Reference[]) => void;
+	onOpenReference?: (ref: Reference) => void;
+	onUpdateReference?: (refId: string, partial: Partial<Reference>) => void;
+	onRemoveReference?: (refId: string) => void;
+};
+
+export type ToolCallExtensionRegistration = {
+	discriminator: string;
+	component: ComponentType<ToolCallActionProps>;
+	displayName: string;
+	matches?: (toolCall: ToolCallRecord) => boolean;
+};
+
+export type ToolCallExtensionRegistryAPI = {
+	register(registration: ToolCallExtensionRegistration): void;
+	resolveAll(
+		toolCall: ToolCallRecord,
+	): ReadonlyArray<ToolCallExtensionRegistration>;
+	registrations(): ReadonlyArray<ToolCallExtensionRegistration>;
+	hasMatch(toolCall: ToolCallRecord): boolean;
+};
diff --git a/frontend/src/services/assignments.ts b/frontend/src/services/assignments.ts
index 670823a..1dc2daa 100644
--- a/frontend/src/services/assignments.ts
+++ b/frontend/src/services/assignments.ts
@@ -1,7 +1,7 @@
+import type { ApiGroundTruth as GroundTruthItemOut } from "../adapters/apiMapper";
 import { client } from "../api/client";
 import type { components } from "../api/generated";
 
-type GroundTruthItemOut = components["schemas"]["GroundTruthItem-Output"];
 type SelfServeResponse = components["schemas"]["SelfServeResponse"];
 
 // Request new assignments (self-serve). Returns payload with assigned items and counts.
diff --git a/frontend/src/services/chatService.ts b/frontend/src/services/chatService.ts
deleted file mode 100644
index 4d3aaa2..0000000
--- a/frontend/src/services/chatService.ts
+++ /dev/null
@@ -1,82 +0,0 @@
-import { client } from "../api/client";
-import type { components } from "../api/generated";
-import type { ConversationTurn, ExpectedBehavior } from "../models/groundTruth";
-import { formatConversationForAgent as formatTurns } from "../models/groundTruth";
-
-export type ChatReference = components["schemas"]["ChatReference"];
-
-/**
- * Maps expected behavior identifiers to descriptive instructions for the agent.
- * These provide rich context to help the backend generate appropriate responses.
- */
-const BEHAVIOR_DESCRIPTIONS: Record<ExpectedBehavior, string> = {
-	"tool:search":
-		"Perform a search or retrieval operation to find relevant information",
-	"generation:answer":
-		"Generate a direct, comprehensive answer to the user's question",
-	"generation:need-context":
-		"Indicate that more context or background information is needed to properly answer the question",
-	"generation:clarification":
-		"Ask for clarification about ambiguous or unclear aspects of the user's question",
-	"generation:out-of-domain":
-		"Politely indicate that the question is outside the scope of what you can help with",
-};
-
-/**
- * Formats expected behavior array into descriptive instructions for inclusion in chat requests.
- * Returns empty string if array is empty or undefined.
- * Example output:
- * "Expected Behavior: Perform a search or retrieval operation to find relevant information; Generate a direct, comprehensive answer to the user's question"
- */
-export function formatExpectedBehaviorForChat(
-	behaviors: ExpectedBehavior[] | undefined,
-): string {
-	if (!behaviors || behaviors.length === 0) return "";
-
-	const descriptions = behaviors
-		.map((behavior) => BEHAVIOR_DESCRIPTIONS[behavior])
-		.filter(Boolean); // Filter out any undefined descriptions
-
-	if (descriptions.length === 0) return "";
-
-	return `Expected Behavior: ${descriptions.join("; ")}`;
-}
-
-export function formatConversationForAgent(
-	turns: ConversationTurn[] | undefined,
-	upToIndex?: number,
-): string {
-	return formatTurns(turns, upToIndex);
-}
-
-export async function callAgentChat(
-	message: string,
-): Promise<{ content: string; references: ChatReference[] }> {
-	const trimmed = message.trim();
-	if (!trimmed) {
-		throw new Error("Agent chat message is required.");
-	}
-
-	// Create AbortController for timeout protection
-	const controller = new AbortController();
-	const timeoutId = setTimeout(() => controller.abort(), 120000); // 120s timeout
-
-	try {
-		const body: components["schemas"]["ChatRequest"] = {
-			message: trimmed,
-		};
-		const { data, error } = await client.POST("/v1/chat", {
-			body,
-			signal: controller.signal,
-		});
-
-		if (error) throw error;
-		const payload = data as components["schemas"]["ChatResponse"];
-		return {
-			content: payload.content,
-			references: payload.references ?? [],
-		};
-	} finally {
-		clearTimeout(timeoutId);
-	}
-}
diff --git a/frontend/src/services/datasets.ts b/frontend/src/services/datasets.ts
index a58ab6a..46f40fe 100644
--- a/frontend/src/services/datasets.ts
+++ b/frontend/src/services/datasets.ts
@@ -17,11 +17,12 @@ let datasetsCache: { data: string[] | null; timestamp: number } = {
  */
 export async function getDatasetCurationInstructions(
 	datasetName: string,
+	signal?: AbortSignal,
 ): Promise<DatasetCurationInstructions | undefined> {
 	if (!datasetName) return undefined;
 	const { data, error } = await client.GET(
 		"/v1/datasets/{datasetName}/curation-instructions",
-		{ params: { path: { datasetName } } },
+		{ params: { path: { datasetName } }, signal },
 	);
 	if (error) throw error;
 	return data as unknown as DatasetCurationInstructions;
@@ -33,6 +34,7 @@ export async function getDatasetCurationInstructions(
  */
 export async function fetchAvailableDatasets(
 	forceRefresh = false,
+	signal?: AbortSignal,
 ): Promise<string[]> {
 	const now = Date.now();
 
@@ -46,7 +48,7 @@ export async function fetchAvailableDatasets(
 	}
 
 	try {
-		const { data, error } = await client.GET("/v1/datasets", {});
+		const { data, error } = await client.GET("/v1/datasets", { signal });
 		if (error) throw error;
 		const raw = Array.isArray(data) ? data : [];
 		const names = new Set<string>();
@@ -64,7 +66,10 @@ export async function fetchAvailableDatasets(
 		};
 
 		return datasets;
-	} catch {
+	} catch (error) {
+		if (signal?.aborted) {
+			throw error;
+		}
 		// On error, return cached data if available, otherwise empty array
 		return datasetsCache.data ?? [];
 	}
diff --git a/frontend/src/services/groundTruths.ts b/frontend/src/services/groundTruths.ts
index 5dfd6de..f99a443 100644
--- a/frontend/src/services/groundTruths.ts
+++ b/frontend/src/services/groundTruths.ts
@@ -1,136 +1,43 @@
+import type {
+	ApiGroundTruth,
+	ApiHistoryEntry,
+	ApiReference,
+} from "../adapters/apiMapper";
+import { groundTruthFromApi } from "../adapters/apiMapper";
 import { client } from "../api/client";
 import type { components, operations } from "../api/generated";
 import type { GroundTruthItem } from "../models/groundTruth";
-import { urlToTitle } from "../models/utils";
 import { getApiBaseUrl, withDevUser } from "./http";
 import { logEvent } from "./telemetry";
 
-type GroundTruthItemOut = components["schemas"]["GroundTruthItem-Output"];
+type GroundTruthItemOut = Omit<
+	components["schemas"]["AgenticGroundTruthEntry-Output"],
+	"history"
+> & {
+	synthQuestion?: string | null;
+	editedQuestion?: string | null;
+	answer?: string | null;
+	refs?: ApiReference[];
+	totalReferences?: number;
+	tags?: string[];
+	comment?: string | null;
+	history?: ApiHistoryEntry[];
+};
 
 export type GroundTruthListPagination =
 	components["schemas"]["PaginationMetadata"];
 
+/**
+ * Maps an API ground truth payload to a domain GroundTruthItem.
+ * Delegates to the canonical groundTruthFromApi adapter to ensure
+ * both the provider path and the explorer/service path produce
+ * identical GroundTruthItem output for the same payload.
+ */
 export function mapGroundTruthFromApi(
 	api: GroundTruthItemOut,
 	providerId = "api",
 ): GroundTruthItem {
-	// Map history if present
-	let history: GroundTruthItem["history"];
-	if (api.history && api.history.length > 0) {
-		// History exists - use it as-is (don't overwrite with synthQuestion)
-		history = api.history.map((h) => ({
-			role: h.role === "assistant" ? "agent" : "user",
-			content: h.msg,
-			expectedBehavior:
-				h.expectedBehavior && h.expectedBehavior.length > 0
-					? h.expectedBehavior
-					: undefined,
-		}));
-	} else {
-		// ALWAYS convert single-turn items to multi-turn format
-		// Legacy single-turn item: create initial history from synthQuestion/editedQuestion
-		const initialQuestion = api.editedQuestion || api.synthQuestion || "";
-		if (initialQuestion) {
-			history = [
-				{
-					role: "user" as const,
-					content: initialQuestion,
-				},
-				{
-					role: "agent" as const,
-					content: api.answer || "", // Empty string if no answer
-				},
-			];
-		}
-	}
-
-	// For multi-turn items, use first user turn content; for single-turn items, use editedQuestion or synthQuestion
-	const question =
-		history && history.length > 0 && history[0].role === "user"
-			? history[0].content
-			: api.editedQuestion || api.synthQuestion || "";
-
-	// Map references from both top-level refs (single-turn) and history refs (multi-turn)
-	const refs: GroundTruthItem["references"] = [];
-
-	// Pre-calculate total reference count for better array allocation
-	const topLevelRefCount = api.refs?.length || 0;
-	const historyRefCount =
-		api.history?.reduce((sum, turn) => sum + (turn.refs?.length || 0), 0) || 0;
-	const totalRefCount = topLevelRefCount + historyRefCount;
-
-	// Pre-allocate array for better memory performance
-	if (totalRefCount > 0) {
-		refs.length = totalRefCount;
-		let refIndex = 0;
-
-		// Process top-level refs
-		if (api.refs?.length) {
-			for (let i = 0; i < api.refs.length; i++) {
-				const r = api.refs[i];
-				refs[refIndex] = {
-					id: `ref_${refIndex}`,
-					title: r.title || (r.url ? urlToTitle(r.url) : undefined),
-					url: r.url,
-					snippet: r.content ?? undefined,
-					keyParagraph: r.keyExcerpt ?? undefined,
-					visitedAt: null,
-					bonus: r.bonus === true,
-					messageIndex:
-						!api.history || api.history.length === 0 ? 1 : undefined,
-				};
-				refIndex++;
-			}
-		}
-
-		// Process history refs in single pass
-		if (api.history?.length) {
-			for (let turnIndex = 0; turnIndex < api.history.length; turnIndex++) {
-				const turn = api.history[turnIndex];
-				if (turn.refs?.length) {
-					for (let i = 0; i < turn.refs.length; i++) {
-						const r = turn.refs[i];
-						refs[refIndex] = {
-							id: `ref_${refIndex}`,
-							title: r.title || (r.url ? urlToTitle(r.url) : undefined),
-							url: r.url,
-							snippet: r.content ?? undefined,
-							keyParagraph: r.keyExcerpt ?? undefined,
-							visitedAt: null,
-							bonus: r.bonus === true,
-							messageIndex: turnIndex,
-						};
-						refIndex++;
-					}
-				}
-			}
-		}
-	}
-
-	const deleted = api.status === "deleted";
-	return {
-		id: api.id,
-		providerId,
-		question,
-		answer: api.answer ?? "",
-		history,
-		comment: api.comment ?? undefined,
-		references: refs,
-		status:
-			(deleted ? "draft" : (api.status as GroundTruthItem["status"])) ||
-			"draft",
-		deleted,
-		tags: api.tags || [],
-		manualTags: api.manualTags || [],
-		computedTags: api.computedTags || [],
-		datasetName: api.datasetName,
-		bucket: (api.bucket as string) || "0",
-		reviewedAt: api.reviewedAt ?? null,
-		totalReferences: api.totalReferences,
-		...({
-			_etag: api._etag,
-		} as Record<string, unknown>),
-	};
+	return groundTruthFromApi(api as ApiGroundTruth, providerId);
 }
 
 interface ListAllGroundTruthsParams {
@@ -154,6 +61,7 @@ interface ListAllGroundTruthsResult {
 
 export async function listAllGroundTruths(
 	params: ListAllGroundTruthsParams = {},
+	signal?: AbortSignal,
 ): Promise<ListAllGroundTruthsResult> {
 	const query: operations["list_all_ground_truths_v1_ground_truths_get"]["parameters"]["query"] =
 		{};
@@ -175,6 +83,7 @@ export async function listAllGroundTruths(
 		params: {
 			query: Object.keys(query).length ? query : undefined,
 		},
+		signal,
 	});
 	if (error) throw error;
 	const payload = (data as unknown as
@@ -192,10 +101,11 @@ export async function getGroundTruth(
 	datasetName: string,
 	bucket: string,
 	id: string,
+	signal?: AbortSignal,
 ): Promise<GroundTruthItem> {
 	const { data, error } = await client.GET(
 		"/v1/ground-truths/{datasetName}/{bucket}/{item_id}",
-		{ params: { path: { datasetName, bucket, item_id: id } } },
+		{ params: { path: { datasetName, bucket, item_id: id } }, signal },
 	);
 
 	if (error) {
diff --git a/frontend/src/services/http.ts b/frontend/src/services/http.ts
index e2f28f8..91fe06e 100644
--- a/frontend/src/services/http.ts
+++ b/frontend/src/services/http.ts
@@ -1,11 +1,28 @@
 /** HTTP helper utilities for backend API calls */
 
+export function normalizeAppBasePath(basePath: string | undefined): string {
+	if (!basePath) return "";
+	const trimmed = basePath.trim();
+	if (!trimmed || trimmed === "/") return "";
+	return `/${trimmed.replace(/^\/+|\/+$/g, "")}`;
+}
+
+export function getAppBasePath(): string {
+	return normalizeAppBasePath(import.meta.env.BASE_URL as string | undefined);
+}
+
+export function prefixAppBasePath(path: string): string {
+	if (!path.startsWith("/") || path.startsWith("//")) return path;
+	const basePath = getAppBasePath();
+	if (!basePath || path === basePath || path.startsWith(`${basePath}/`)) {
+		return path;
+	}
+	return `${basePath}${path}`;
+}
+
 export function getApiBaseUrl(): string {
-	// In the browser, use relative "/v1" so Vite dev proxy can intercept; in production
-	// VITE_API_BASE_URL can still be used if absolute URLs are needed in deployments.
-	// Prefer relative path for browser calls; backend is expected under /v1
-	// If you need absolute base for SSR or other contexts, revise as needed.
-	return "/v1";
+	// Keep browser calls same-origin, but honor an optional Vite base path like "/gtc".
+	return prefixAppBasePath("/v1");
 }
 
 export function withDevUser(init: RequestInit = {}): RequestInit {
diff --git a/frontend/src/services/runtimeConfig.ts b/frontend/src/services/runtimeConfig.ts
index 5cdca67..ad67302 100644
--- a/frontend/src/services/runtimeConfig.ts
+++ b/frontend/src/services/runtimeConfig.ts
@@ -5,12 +5,43 @@
  * that can be changed without rebuilding the frontend.
  */
 
+import { useSyncExternalStore } from "react";
 import type { components } from "../api/generated";
+import { getApiBaseUrl } from "./http";
 
-type RuntimeConfig = components["schemas"]["FrontendConfig"];
+export type RuntimeConfig = components["schemas"]["FrontendConfig"];
 
 let cachedConfig: RuntimeConfig | null = null;
 let configPromise: Promise<RuntimeConfig> | null = null;
+const listeners = new Set<() => void>();
+
+function notifyListeners() {
+	for (const listener of listeners) {
+		listener();
+	}
+}
+
+function setCachedConfig(config: RuntimeConfig) {
+	cachedConfig = config;
+	notifyListeners();
+}
+
+function buildFallbackConfig(): RuntimeConfig {
+	const trustedDomainsRaw =
+		(import.meta.env.VITE_TRUSTED_REFERENCE_DOMAINS as string | undefined) ??
+		"";
+	const trustedReferenceDomains = trustedDomainsRaw
+		.split(",")
+		.map((d) => d.trim().toLowerCase())
+		.filter(Boolean);
+
+	return {
+		requireReferenceVisit: getEnvBoolean("VITE_REQUIRE_REFERENCE_VISIT", true),
+		requireKeyParagraph: getEnvBoolean("VITE_REQUIRE_KEY_PARAGRAPH", false),
+		selfServeLimit: getEnvNumber("VITE_SELF_SERVE_LIMIT", 10),
+		trustedReferenceDomains,
+	};
+}
 
 /**
  * Fetch runtime configuration from backend.
@@ -31,10 +62,10 @@ export async function getRuntimeConfig(): Promise<RuntimeConfig> {
 	// Fetch config from backend
 	configPromise = (async () => {
 		try {
-			const response = await fetch("/v1/config");
+			const response = await fetch(`${getApiBaseUrl()}/config`);
 			if (response.ok) {
 				const config: RuntimeConfig = await response.json();
-				cachedConfig = config;
+				setCachedConfig(config);
 				return config;
 			}
 		} catch (error) {
@@ -44,25 +75,8 @@ export async function getRuntimeConfig(): Promise<RuntimeConfig> {
 			);
 		}
 
-		// Fallback to environment variables (for local dev)
-		const trustedDomainsRaw =
-			(import.meta.env.VITE_TRUSTED_REFERENCE_DOMAINS as string | undefined) ??
-			"";
-		const trustedReferenceDomains = trustedDomainsRaw
-			.split(",")
-			.map((d) => d.trim().toLowerCase())
-			.filter(Boolean);
-
-		const fallbackConfig: RuntimeConfig = {
-			requireReferenceVisit: getEnvBoolean(
-				"VITE_REQUIRE_REFERENCE_VISIT",
-				true,
-			),
-			requireKeyParagraph: getEnvBoolean("VITE_REQUIRE_KEY_PARAGRAPH", false),
-			selfServeLimit: getEnvNumber("VITE_SELF_SERVE_LIMIT", 10),
-			trustedReferenceDomains,
-		};
-		cachedConfig = fallbackConfig;
+		const fallbackConfig = buildFallbackConfig();
+		setCachedConfig(fallbackConfig);
 		return fallbackConfig;
 	})();
 
@@ -90,10 +104,29 @@ function getEnvNumber(key: string, defaultValue: number): number {
 	return Number.isNaN(parsed) ? defaultValue : parsed;
 }
 
+export function subscribeToRuntimeConfig(listener: () => void) {
+	listeners.add(listener);
+	return () => {
+		listeners.delete(listener);
+	};
+}
+
+export function getRuntimeConfigSnapshot(): RuntimeConfig | null {
+	return cachedConfig;
+}
+
 /**
  * Synchronously get cached config (must call getRuntimeConfig first).
  * Returns null if config not yet loaded.
  */
 export function getCachedConfig(): RuntimeConfig | null {
-	return cachedConfig;
+	return getRuntimeConfigSnapshot();
+}
+
+export function useRuntimeConfig(): RuntimeConfig | null {
+	return useSyncExternalStore(
+		subscribeToRuntimeConfig,
+		getRuntimeConfigSnapshot,
+		getRuntimeConfigSnapshot,
+	);
 }
diff --git a/frontend/src/services/search.ts b/frontend/src/services/search.ts
index 0ad5b16..b7b002e 100644
--- a/frontend/src/services/search.ts
+++ b/frontend/src/services/search.ts
@@ -61,11 +61,13 @@ function mapWireToReference(x: SearchResultWire): Reference | null {
 export async function searchReferences(
 	query: string,
 	top = 10,
+	signal?: AbortSignal,
 ): Promise<Reference[]> {
 	const q = query.trim();
 	if (!q) return [];
 	const { data, error } = await client.GET("/v1/search", {
 		params: { query: { q, top } },
+		signal,
 	});
 	if (error) throw error;
 	let arrUnknown: unknown[] = [];
@@ -94,21 +96,85 @@ export async function searchReferences(
 	return mapped;
 }
 
+function createAbortError(): DOMException {
+	return new DOMException("The operation was aborted.", "AbortError");
+}
+
+function abortableDelay(ms: number, signal?: AbortSignal): Promise<void> {
+	if (signal?.aborted) {
+		return Promise.reject(createAbortError());
+	}
+
+	return new Promise((resolve, reject) => {
+		const onAbort = () => {
+			window.clearTimeout(timeoutId);
+			reject(createAbortError());
+		};
+		const timeoutId = window.setTimeout(() => {
+			signal?.removeEventListener("abort", onAbort);
+			resolve();
+		}, ms);
+
+		signal?.addEventListener("abort", onAbort, { once: true });
+	});
+}
+
 // Mock for demo mode only
-export async function mockAiSearch(query: string): Promise<Reference[]> {
-	await new Promise((r) => setTimeout(r, 500));
-	const base = `https://example.com/product/${encodeURIComponent(
-		query.toLowerCase().replace(/\s+/g, "-"),
-	)}`;
-	const mk = (n: number): Reference => ({
+export async function mockAiSearch(
+	query: string,
+	signal?: AbortSignal,
+): Promise<Reference[]> {
+	await abortableDelay(500, signal);
+	const normalized = query.trim().toLowerCase();
+	const catalog = [
+		{
+			slug: "data-usage-check-usage",
+			title: "Check mobile data usage",
+			snippet:
+				"Compare current-cycle usage to the plan cap before treating a spike as a defect.",
+		},
+		{
+			slug: "wifi-assist",
+			title: "Reduce cellular usage with Wi-Fi",
+			snippet:
+				"Streaming and tethering over cellular are common causes of data overage charges.",
+		},
+		{
+			slug: "travel-pass-timing",
+			title: "Travel pass activation timing",
+			snippet:
+				"Roaming passes only apply after activation and do not retroactively cover earlier sessions.",
+		},
+		{
+			slug: "sim-swap-refresh",
+			title: "Refresh service after SIM swap",
+			snippet:
+				"If feature entitlements lag a SIM swap, run a targeted refresh before escalating.",
+		},
+		{
+			slug: "event-congestion",
+			title: "Understand temporary event congestion",
+			snippet:
+				"Large venues can saturate nearby sectors briefly without indicating a persistent outage.",
+		},
+	];
+
+	const ranked = catalog
+		.filter((entry) => {
+			if (!normalized) return true;
+			const haystack = `${entry.title} ${entry.snippet}`.toLowerCase();
+			return haystack.includes(normalized);
+		})
+		.slice(0, 5);
+
+	const res = (ranked.length ? ranked : catalog.slice(0, 5)).map((entry) => ({
 		id: randId("ref"),
-		title: `${query} – Result ${n}`,
-		url: `${base}-${n}`,
-		snippet: `Relevant snippet ${n} for ${query}. Mentions key commands, options, and caveats...`,
+		title: entry.title,
+		url: `https://telco.example.com/help/${entry.slug}`,
+		snippet: entry.snippet,
 		visitedAt: null,
 		keyParagraph: "",
-	});
-	const res = [mk(1), mk(2), mk(3), mk(4), mk(5)];
+	}));
 	try {
 		logEvent("gtc.search", {
 			queryLen: query.trim().length,
diff --git a/frontend/src/services/stats.ts b/frontend/src/services/stats.ts
index d905c29..331a91b 100644
--- a/frontend/src/services/stats.ts
+++ b/frontend/src/services/stats.ts
@@ -51,11 +51,11 @@ export async function getGroundTruthStats(): Promise<StatsPayload> {
 export async function mockGetGroundTruthStats(): Promise<StatsPayload> {
 	await new Promise((r) => setTimeout(r, 150));
 	return {
-		total: { approved: 12, draft: 7, deleted: 2 },
+		total: { approved: 9, draft: 6, deleted: 2 },
 		perSprint: [
-			{ sprint: "Sprint 24.7", approved: 3, draft: 1, deleted: 0 },
-			{ sprint: "Sprint 24.8", approved: 5, draft: 2, deleted: 1 },
-			{ sprint: "Sprint 24.9", approved: 4, draft: 4, deleted: 1 },
+			{ sprint: "Trace Batch 26.1", approved: 2, draft: 1, deleted: 0 },
+			{ sprint: "Trace Batch 26.2", approved: 4, draft: 3, deleted: 1 },
+			{ sprint: "Trace Batch 26.3", approved: 3, draft: 2, deleted: 1 },
 		],
 	};
 }
diff --git a/frontend/src/services/tags.ts b/frontend/src/services/tags.ts
index 7ffe582..225e28e 100644
--- a/frontend/src/services/tags.ts
+++ b/frontend/src/services/tags.ts
@@ -2,18 +2,102 @@ import { client } from "../api/client";
 import type { components } from "../api/generated";
 
 type TagSchema = components["schemas"]["TagSchemaResponse"];
+type GlossaryResponse = components["schemas"]["GlossaryResponse"];
 
 /** Response structure with separate manual and computed tags */
-interface TagsWithComputed {
+export interface TagsWithComputed {
 	manualTags: string[];
 	computedTags: string[];
 }
 
-/**
- * Fetch tag schema from backend with retry logic.
- * Returns null if unable to fetch after retries.
- */
-export async function fetchTagSchema(): Promise<TagSchema | null> {
+interface TagMetadataSnapshot extends TagsWithComputed {
+	allTags: string[];
+	loading: boolean;
+	error: Error | null;
+}
+
+export interface TagGlossary {
+	[tagKey: string]: string | undefined;
+}
+
+const EMPTY_TAGS: TagsWithComputed = {
+	manualTags: [],
+	computedTags: [],
+};
+
+let tagsCache: TagsWithComputed | null = null;
+let tagsInFlight: Promise<TagsWithComputed> | null = null;
+let tagsLoading = false;
+let tagsError: Error | null = null;
+const tagListeners = new Set<() => void>();
+let tagSnapshot: TagMetadataSnapshot = buildTagSnapshot();
+
+let schemaCache: TagSchema | null = null;
+let schemaInFlight: Promise<TagSchema | null> | null = null;
+
+let glossaryCache: GlossaryResponse | null = null;
+let glossaryInFlight: Promise<GlossaryResponse> | null = null;
+
+function normalizeError(error: unknown, fallback: string): Error {
+	if (error instanceof Error) {
+		return error;
+	}
+
+	if (typeof error === "string" && error.trim()) {
+		return new Error(error);
+	}
+
+	return new Error(fallback);
+}
+
+function sortTags(tags: string[]): string[] {
+	return [...tags].sort((a, b) => a.localeCompare(b));
+}
+
+function normalizeTags(response?: {
+	tags?: string[];
+	computedTags?: string[];
+}): TagsWithComputed {
+	return {
+		manualTags: sortTags(response?.tags ?? []),
+		computedTags: sortTags(response?.computedTags ?? []),
+	};
+}
+
+function getAllTags(tags: TagsWithComputed): string[] {
+	return sortTags([...new Set([...tags.manualTags, ...tags.computedTags])]);
+}
+
+function buildTagSnapshot(): TagMetadataSnapshot {
+	const cachedTags = tagsCache ?? EMPTY_TAGS;
+	return {
+		...cachedTags,
+		allTags: getAllTags(cachedTags),
+		loading: tagsLoading,
+		error: tagsError,
+	};
+}
+
+function updateTagSnapshot() {
+	tagSnapshot = buildTagSnapshot();
+	for (const listener of tagListeners) {
+		listener();
+	}
+}
+
+async function requestTagsWithComputed(): Promise<TagsWithComputed> {
+	const { data, error } = await client.GET("/v1/tags", {});
+	if (error) {
+		throw normalizeError(error, "Failed to fetch tags");
+	}
+
+	const response = data as
+		| { tags?: string[]; computedTags?: string[] }
+		| undefined;
+	return normalizeTags(response);
+}
+
+async function requestTagSchemaWithRetry(): Promise<TagSchema | null> {
 	const maxRetries = 3;
 	const initialRetryDelayMs = 200;
 
@@ -21,39 +105,150 @@ export async function fetchTagSchema(): Promise<TagSchema | null> {
 		try {
 			const { data, error } = await client.GET("/v1/tags/schema", {});
 			if (error) {
-				if (attempt === maxRetries) {
-					console.warn("Failed to fetch tag schema after retries:", error);
-					return null;
-				}
-				const delay = initialRetryDelayMs * 2 ** (attempt - 1);
-				await new Promise((resolve) => setTimeout(resolve, delay));
-				continue;
+				throw error;
 			}
+
 			return data as TagSchema;
-		} catch (err) {
+		} catch (error) {
 			if (attempt === maxRetries) {
-				console.warn("Failed to fetch tag schema after retries:", err);
+				console.warn("Failed to fetch tag schema after retries:", error);
 				return null;
 			}
+
 			const delay = initialRetryDelayMs * 2 ** (attempt - 1);
 			await new Promise((resolve) => setTimeout(resolve, delay));
 		}
 	}
+
 	return null;
 }
 
+export function subscribeToTagMetadata(listener: () => void) {
+	tagListeners.add(listener);
+	return () => {
+		tagListeners.delete(listener);
+	};
+}
+
+export function getTagMetadataSnapshot(): TagMetadataSnapshot {
+	return tagSnapshot;
+}
+
+/**
+ * Fetch tags with separate manual and computed arrays.
+ * GET /v1/tags now returns { tags: [...], computedTags: [...] }
+ */
+export async function fetchTagsWithComputed(options?: {
+	force?: boolean;
+}): Promise<TagsWithComputed> {
+	const force = options?.force ?? false;
+
+	if (!force && tagsCache) {
+		return tagsCache;
+	}
+
+	if (tagsInFlight) {
+		return tagsInFlight;
+	}
+
+	tagsLoading = true;
+	tagsError = null;
+	updateTagSnapshot();
+
+	tagsInFlight = requestTagsWithComputed()
+		.then((tags) => {
+			tagsCache = tags;
+			tagsError = null;
+			return tags;
+		})
+		.catch((error) => {
+			tagsError = normalizeError(error, "Failed to fetch tags");
+			return tagsCache ?? EMPTY_TAGS;
+		})
+		.finally(() => {
+			tagsLoading = false;
+			tagsInFlight = null;
+			updateTagSnapshot();
+		});
+
+	return tagsInFlight;
+}
+
+export function ensureTagCached(tag: string) {
+	const normalizedTag = tag.trim();
+	if (!normalizedTag) {
+		return;
+	}
+
+	const current = tagsCache ?? EMPTY_TAGS;
+	if (
+		current.manualTags.includes(normalizedTag) ||
+		current.computedTags.includes(normalizedTag)
+	) {
+		return;
+	}
+
+	tagsCache = {
+		manualTags: sortTags([...current.manualTags, normalizedTag]),
+		computedTags: current.computedTags,
+	};
+	tagsError = null;
+	updateTagSnapshot();
+}
+
+/**
+ * Fetch all available tags (manual + computed merged).
+ * For backward compatibility with existing callers.
+ */
+export async function fetchAvailableTags(options?: {
+	force?: boolean;
+}): Promise<string[]> {
+	const tags = await fetchTagsWithComputed(options);
+	return getAllTags(tags);
+}
+
+/**
+ * Fetch tag schema from backend with retry logic.
+ * Returns null if unable to fetch after retries.
+ */
+export async function fetchTagSchema(options?: {
+	force?: boolean;
+}): Promise<TagSchema | null> {
+	const force = options?.force ?? false;
+
+	if (!force && schemaCache) {
+		return schemaCache;
+	}
+
+	if (schemaInFlight) {
+		return schemaInFlight;
+	}
+
+	schemaInFlight = requestTagSchemaWithRetry()
+		.then((schema) => {
+			if (schema) {
+				schemaCache = schema;
+			}
+			return schema;
+		})
+		.finally(() => {
+			schemaInFlight = null;
+		});
+
+	return schemaInFlight;
+}
+
 /**
  * Get the set of exclusive tag group names.
  * Fetches schema on-demand if not already loaded.
  */
 async function getExclusiveGroups(): Promise<Set<string>> {
-	// Try to ensure schema is loaded
 	const schema = await fetchTagSchema();
 
 	if (schema?.groups) {
 		const exclusiveGroups = schema.groups
-			.filter((g) => g.exclusive)
-			.map((g) => g.name);
+			.filter((group) => group.exclusive)
+			.map((group) => group.name);
 		return new Set(exclusiveGroups);
 	}
 
@@ -96,53 +291,80 @@ export async function validateExclusiveTags(
 	return null;
 }
 
-/**
- * Fetch tags with separate manual and computed arrays.
- * GET /v1/tags now returns { tags: [...], computedTags: [...] }
- */
-export async function fetchTagsWithComputed(): Promise<TagsWithComputed> {
-	try {
-		const { data, error } = await client.GET("/v1/tags", {});
-		if (error) throw error;
-		const response = data as unknown as
-			| { tags?: string[]; computedTags?: string[] }
-			| undefined;
-		const manualTags = [...(response?.tags ?? [])].sort((a, b) =>
-			a.localeCompare(b),
-		);
-		const computedTags = [...(response?.computedTags ?? [])].sort((a, b) =>
-			a.localeCompare(b),
-		);
-		return { manualTags, computedTags };
-	} catch {
-		// No-op; return empty
-	}
-	return { manualTags: [], computedTags: [] };
+export async function fetchTagGlossary(options?: {
+	force?: boolean;
+}): Promise<GlossaryResponse> {
+	const force = options?.force ?? false;
+
+	if (!force && glossaryCache) {
+		return glossaryCache;
+	}
+
+	if (glossaryInFlight) {
+		return glossaryInFlight;
+	}
+
+	glossaryInFlight = client
+		.GET("/v1/tags/glossary", {})
+		.then(({ data, error }) => {
+			if (error) {
+				throw normalizeError(error, "Failed to fetch tag glossary");
+			}
+
+			const glossary = data as GlossaryResponse;
+			glossaryCache = glossary;
+			return glossary;
+		})
+		.finally(() => {
+			glossaryInFlight = null;
+		});
+
+	return glossaryInFlight;
 }
 
-/**
- * Fetch all available tags (manual + computed merged).
- * For backward compatibility with existing callers.
- */
-export async function fetchAvailableTags(): Promise<string[]> {
-	const { manualTags, computedTags } = await fetchTagsWithComputed();
-	const merged = [...new Set([...manualTags, ...computedTags])];
-	return merged.sort((a, b) => a.localeCompare(b));
+export function buildTagGlossaryMap(glossary: GlossaryResponse): TagGlossary {
+	const glossaryMap: TagGlossary = {};
+
+	for (const group of glossary.groups || []) {
+		for (const tag of group.tags || []) {
+			if (tag.key && tag.description) {
+				glossaryMap[tag.key] = tag.description;
+			}
+		}
+	}
+
+	return glossaryMap;
+}
+
+export function clearTagGlossaryCache() {
+	glossaryCache = null;
+	glossaryInFlight = null;
 }
 
 /** Add tags to the global tags collection. Returns the updated list. */
 export async function addTags(tags: string[]): Promise<string[]> {
-	const unique = Array.from(new Set(tags.map((t) => t.trim()).filter(Boolean)));
+	const unique = Array.from(
+		new Set(tags.map((tag) => tag.trim()).filter(Boolean)),
+	);
 	if (!unique.length) return [];
+
 	const { data, error } = await client.POST("/v1/tags", {
 		body: {
 			tags: unique,
 		} as unknown as components["schemas"]["AddTagsRequest"],
 	});
 	if (error) throw error;
+
 	const response = data as unknown as { tags?: string[] } | undefined;
-	const list = response?.tags || [];
-	return [...list].sort((a, b) => a.localeCompare(b));
+	const manualTags = sortTags(response?.tags || []);
+	tagsCache = {
+		manualTags,
+		computedTags: tagsCache?.computedTags ?? [],
+	};
+	tagsError = null;
+	updateTagSnapshot();
+
+	return manualTags;
 }
 
 /** Create or update a custom tag definition */
@@ -157,6 +379,7 @@ export async function createTagDefinition(
 		} as unknown as components["schemas"]["TagDefinitionRequest"],
 	});
 	if (error) throw error;
+	clearTagGlossaryCache();
 }
 
 /** Delete a custom tag definition */
@@ -165,4 +388,5 @@ export async function deleteTagDefinition(tagKey: string): Promise<void> {
 		params: { path: { tag_key: tagKey } },
 	});
 	if (error) throw error;
+	clearTagGlossaryCache();
 }
diff --git a/frontend/src/types/filters.ts b/frontend/src/types/filters.ts
index 7444506..59594fa 100644
--- a/frontend/src/types/filters.ts
+++ b/frontend/src/types/filters.ts
@@ -8,6 +8,7 @@ export type SortColumn =
 	| "reviewedAt"
 	| "hasAnswer"
 	| "tagCount"
+	| "toolCallCount"
 	| null;
 export type SortDirection = "asc" | "desc";
 
diff --git a/frontend/tests/e2e/curation-flow.spec.ts b/frontend/tests/e2e/curation-flow.spec.ts
new file mode 100644
index 0000000..209f76a
--- /dev/null
+++ b/frontend/tests/e2e/curation-flow.spec.ts
@@ -0,0 +1,100 @@
+import { expect, type Page, test } from "@playwright/test";
+import {
+	datasetNameForRun,
+	itemIdForDataset,
+	openExplorerAndFilter,
+	seedDeterministicItem,
+} from "./helpers";
+
+const TOOL_NAME = "search_docs";
+
+async function editTurn(page: Page, index: number, nextContent: string) {
+	const turn = page.locator(`[data-turn-index="${index}"]`).first();
+	await turn.getByRole("button", { name: "Edit" }).click();
+	await turn.locator("textarea").fill(nextContent);
+	await turn.getByRole("button", { name: "Save" }).click();
+	await expect(turn).toContainText(nextContent);
+}
+
+async function setToolCallDecision(page: Page, label: RegExp) {
+	const relevanceGroup = page
+		.getByRole("radiogroup", {
+			name: "Tool call relevance",
+		})
+		.first();
+	await relevanceGroup.getByRole("button", { name: label }).click();
+}
+
+test("curator flow persists edits and shows approved item in Explorer", async ({
+	page,
+}) => {
+	const datasetName = datasetNameForRun();
+	const itemId = itemIdForDataset(datasetName);
+	const editedUserMessage = "Edited user message from the first Playwright E2E";
+	const editedAgentMessage =
+		"Edited agent response persisted through the real backend";
+	const editedComment = "Persisted curator note from Playwright E2E";
+
+	await seedDeterministicItem(datasetName, itemId, TOOL_NAME);
+
+	await page.goto("/");
+	await expect(page.getByText("Ground Truth Curator")).toBeVisible();
+
+	await openExplorerAndFilter(page, datasetName, itemId);
+	await expect(
+		page.getByRole("button", { name: `Assign ${itemId}` }),
+	).toBeVisible();
+
+	await page.getByRole("button", { name: `Assign ${itemId}` }).click();
+	await expect(page.locator('[data-turn-index="0"]').first()).toContainText(
+		"Original seeded user message",
+	);
+
+	await editTurn(page, 0, editedUserMessage);
+	await editTurn(page, 1, editedAgentMessage);
+
+	await page
+		.getByRole("button", { name: `Toggle tool call ${TOOL_NAME}` })
+		.first()
+		.click();
+	await setToolCallDecision(page, /Not needed/);
+	await setToolCallDecision(page, /Optional/);
+	await setToolCallDecision(page, /Required/);
+	await expect(
+		page.getByRole("button", { name: /Required/ }).first(),
+	).toHaveAttribute("aria-pressed", "true");
+
+	await page.getByRole("textbox", { name: "Comments" }).fill(editedComment);
+	await page.getByRole("button", { name: "Save Draft" }).click();
+	await expect(page.getByText(`Saved ${itemId} – draft`)).toBeVisible();
+
+	await page.reload();
+	await page.getByRole("option", { name: new RegExp(itemId) }).click();
+	await expect(page.locator('[data-turn-index="0"]').first()).toContainText(
+		editedUserMessage,
+	);
+	await expect(page.locator('[data-turn-index="1"]').first()).toContainText(
+		editedAgentMessage,
+	);
+
+	await page
+		.getByRole("button", { name: `Toggle tool call ${TOOL_NAME}` })
+		.first()
+		.click();
+	await expect(
+		page.getByRole("button", { name: /Required/ }).first(),
+	).toHaveAttribute("aria-pressed", "true");
+	await expect(page.getByRole("textbox", { name: "Comments" })).toHaveValue(
+		editedComment,
+	);
+
+	await page.getByRole("button", { name: "Approve" }).click();
+	await expect(page.getByText(`Saved ${itemId} – approved`)).toBeVisible();
+
+	await openExplorerAndFilter(page, datasetName, itemId, "approved");
+	const approvedRow = page
+		.locator("tbody tr")
+		.filter({ hasText: itemId })
+		.first();
+	await expect(approvedRow).toContainText("approved");
+});
diff --git a/frontend/tests/e2e/helpers.ts b/frontend/tests/e2e/helpers.ts
new file mode 100644
index 0000000..502c20c
--- /dev/null
+++ b/frontend/tests/e2e/helpers.ts
@@ -0,0 +1,115 @@
+import { expect, type Page } from "@playwright/test";
+
+const BUCKET = "00000000-0000-0000-0000-000000000000";
+const DEV_USER =
+	process.env.PLAYWRIGHT_DEV_USER ?? "playwright-e2e@example.com";
+const BACKEND_URL =
+	process.env.PLAYWRIGHT_BACKEND_URL ?? "http://127.0.0.1:8010";
+
+export function datasetNameForRun() {
+	return `playwright-e2e-${Date.now()}`;
+}
+
+export function itemIdForDataset(datasetName: string) {
+	return `${datasetName}-item`;
+}
+
+export async function seedDeterministicItem(
+	datasetName: string,
+	itemId: string,
+	toolName = "search_docs",
+) {
+	const response = await fetch(`${BACKEND_URL}/v1/ground-truths`, {
+		method: "POST",
+		headers: {
+			"Content-Type": "application/json",
+			"X-User-Id": DEV_USER,
+		},
+		body: JSON.stringify([
+			{
+				id: itemId,
+				datasetName,
+				bucket: BUCKET,
+				status: "draft",
+				comment: "Seeded by Playwright E2E",
+				history: [
+					{
+						role: "user",
+						msg: "Original seeded user message",
+					},
+					{
+						role: "assistant",
+						msg: "Original seeded agent response",
+					},
+				],
+				toolCalls: [
+					{
+						id: "tool-call-1",
+						name: toolName,
+						callType: "tool",
+						stepNumber: 1,
+						arguments: {
+							query: "ground truth curator",
+						},
+						response: {
+							hits: ["Ground Truth Curator docs"],
+						},
+					},
+				],
+				expectedTools: {
+					required: [],
+					optional: [{ name: toolName }],
+					notNeeded: [],
+				},
+				metadata: {
+					seededBy: "playwright",
+				},
+				scenarioId: "playwright-real-e2e",
+			},
+		]),
+	});
+
+	if (!response.ok) {
+		throw new Error(`Seed failed: ${response.status} ${await response.text()}`);
+	}
+
+	const body = await response.json();
+	expect(body.imported).toBe(1);
+	expect(body.uuids).toContain(itemId);
+}
+
+export async function filterExplorerResults(
+	page: Page,
+	datasetName: string,
+	itemId: string,
+	status: "all" | "approved" = "all",
+) {
+	await expect(page.getByText("Explore all ground truths")).toBeVisible();
+
+	const datasetSelect = page.getByLabel("Dataset:");
+	await expect
+		.poll(async () => {
+			const options = await datasetSelect.locator("option").allTextContents();
+			return options.includes(datasetName);
+		})
+		.toBe(true);
+	await datasetSelect.selectOption(datasetName);
+
+	await page.getByLabel("Item ID:").fill(itemId);
+
+	if (status === "approved") {
+		await page.getByRole("button", { name: "Approved" }).click();
+	}
+
+	await page.getByRole("button", { name: "Apply Filters" }).click();
+}
+
+export async function openExplorerAndFilter(
+	page: Page,
+	datasetName: string,
+	itemId: string,
+	status: "all" | "approved" = "all",
+) {
+	await page.getByRole("button", { name: "Explorer" }).click();
+	await filterExplorerResults(page, datasetName, itemId, status);
+}
diff --git a/frontend/tests/e2e/self-serve-assignment-flow.spec.ts b/frontend/tests/e2e/self-serve-assignment-flow.spec.ts
new file mode 100644
index 0000000..12146eb
--- /dev/null
+++ b/frontend/tests/e2e/self-serve-assignment-flow.spec.ts
@@ -0,0 +1,81 @@
+import { expect, type Page, test } from "@playwright/test";
+import {
+	datasetNameForRun,
+	itemIdForDataset,
+	seedDeterministicItem,
+} from "./helpers";
+
+function escapeRegExp(value: string) {
+	return value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
+}
+
+async function editTurn(page: Page, index: number, nextContent: string) {
+	const turn = page.locator(`[data-turn-index="${index}"]`).first();
+	await turn.getByRole("button", { name: "Edit" }).click();
+	await turn.locator("textarea").fill(nextContent);
+	await turn.getByRole("button", { name: "Save" }).click();
+	await expect(turn).toContainText(nextContent);
+}
+
+test("self-serve assignment flow persists saved content after reload", async ({
+	page,
+}) => {
+	const datasetName = datasetNameForRun();
+	const itemId = itemIdForDataset(datasetName);
+	const editedUserMessage =
+		"Edited user message from the self-serve Playwright E2E";
+	const editedComment = "Persisted self-serve curator note from Playwright E2E";
+
+	await seedDeterministicItem(datasetName, itemId);
+	await page.route("**/v1/config", async (route) => {
+		await route.fulfill({
+			status: 200,
+			contentType: "application/json",
+			body: JSON.stringify({
+				requireReferenceVisit: true,
+				requireKeyParagraph: false,
+				selfServeLimit: 500,
+				trustedReferenceDomains: [],
+			}),
+		});
+	});
+
+	await page.goto("/");
+	await expect(page.getByText("Ground Truth Curator")).toBeVisible();
+
+	const queue = page.getByRole("listbox", { name: "Queue" });
+	await page.getByRole("button", { name: "Request More (Self‑serve)" }).click();
+	await expect(page.getByText(/Assigned \d+ item\(s\) to you\./)).toBeVisible();
+
+	await page.getByRole("button", { name: "Refresh" }).click();
+	await expect(page.getByText("Refreshed queue.")).toBeVisible();
+
+	const queueItem = queue.getByRole("option", { name: new RegExp(itemId) });
+	await expect(queueItem).toBeVisible();
+	await queueItem.click();
+	await expect(queueItem).toHaveAttribute("aria-selected", "true");
+	await expect(page.locator('[data-turn-index="0"]').first()).toContainText(
+		"Original seeded user message",
+	);
+
+	await editTurn(page, 0, editedUserMessage);
+	await page.getByRole("textbox", { name: "Comments" }).fill(editedComment);
+	await page.getByRole("button", { name: "Save Draft" }).click();
+	await expect(
+		page.getByText(new RegExp(`^Saved ${escapeRegExp(itemId)} – draft$`)),
+	).toBeVisible();
+
+	await page.reload();
+	const reloadedQueueItem = page.getByRole("option", {
+		name: new RegExp(`^${escapeRegExp(itemId)}\\b`),
+	});
+	await expect(reloadedQueueItem).toBeVisible();
+	await reloadedQueueItem.click();
+	await expect(reloadedQueueItem).toHaveAttribute("aria-selected", "true");
+	await expect(page.locator('[data-turn-index="0"]').first()).toContainText(
+		editedUserMessage,
+	);
+	await expect(page.getByRole("textbox", { name: "Comments" })).toHaveValue(
+		editedComment,
+	);
+});
diff --git a/frontend/tests/e2e/tag-management-flow.spec.ts b/frontend/tests/e2e/tag-management-flow.spec.ts
new file mode 100644
index 0000000..7752a91
--- /dev/null
+++ b/frontend/tests/e2e/tag-management-flow.spec.ts
@@ -0,0 +1,96 @@
+import { expect, test } from "@playwright/test";
+import {
+	datasetNameForRun,
+	itemIdForDataset,
+	openExplorerAndFilter,
+	seedDeterministicItem,
+} from "./helpers";
+
+test("tag management persists manual tags and filters Explorer results", async ({
+	page,
+}) => {
+	const datasetName = datasetNameForRun();
+	const itemId = itemIdForDataset(datasetName);
+	const manualTagValue = "playwright-e2e-tag-management";
+	const normalizedManualTag = `custom:${manualTagValue}`;
+
+	await seedDeterministicItem(datasetName, itemId);
+
+	await page.goto("/");
+	await expect(page.getByText("Ground Truth Curator")).toBeVisible();
+
+	await page.getByRole("button", { name: /Glossary/i }).click();
+	await expect(
+		page.getByRole("heading", { name: "Tag Glossary" }),
+	).toBeVisible();
+	await page.keyboard.press("Escape");
+	await expect(page.getByRole("heading", { name: "Tag Glossary" })).toHaveCount(
+		0,
+	);
+
+	await openExplorerAndFilter(page, datasetName, itemId);
+	await page.getByRole("button", { name: `Assign ${itemId}` }).click();
+	await expect(page.locator('[data-turn-index="0"]').first()).toContainText(
+		"Original seeded user message",
+	);
+
+	await page.getByRole("button", { name: "Manage Tags" }).click();
+	const tagsDialog = page.getByRole("dialog", { name: "Manage Tags" });
+	await expect(tagsDialog.getByText("Ground Truth Level Tags")).toBeVisible();
+	await tagsDialog.getByPlaceholder("Enter tag name...").fill(manualTagValue);
+	await tagsDialog.getByRole("button", { name: "Add" }).click();
+	await tagsDialog.getByRole("button", { name: "Done" }).click();
+
+	await expect(
+		page.getByText(normalizedManualTag, { exact: true }).first(),
+	).toBeVisible();
+
+	await page.getByRole("button", { name: "Save Draft" }).click();
+	await expect(page.getByText(`Saved ${itemId} – draft`)).toBeVisible();
+
+	await page.reload();
+	await page.getByRole("option", { name: new RegExp(itemId) }).click();
+	await expect(page.locator('[data-turn-index="0"]').first()).toContainText(
+		"Original seeded user message",
+	);
+	await expect(
+		page.getByText(normalizedManualTag, { exact: true }).first(),
+	).toBeVisible();
+
+	await page.getByRole("button", { name: "Explorer" }).click();
+	await expect(page.getByText("Explore all ground truths")).toBeVisible();
+
+	const referenceUrlFilter = page.getByLabel("Reference URL:");
+	await expect(referenceUrlFilter).toBeVisible();
+	await referenceUrlFilter.fill("https://example.com/reference");
+	await expect(referenceUrlFilter).toHaveValue("https://example.com/reference");
+	await referenceUrlFilter.clear();
+
+	const keywordFilter = page.getByLabel("Keyword Search:");
+	await expect(keywordFilter).toBeVisible();
+	await keywordFilter.fill(manualTagValue);
+	await expect(keywordFilter).toHaveValue(manualTagValue);
+	await keywordFilter.clear();
+
+	await page
+		.getByRole("button", { name: /Expand tag filters|Collapse tag filters/ })
+		.click();
+	const manualTagFilter = page
+		.locator("button")
+		.filter({ hasText: normalizedManualTag })
+		.first();
+	await manualTagFilter.click();
+	await expect(manualTagFilter).toContainText("✓");
+	await expect(
+		page.getByText(`Including tag: ${normalizedManualTag}`),
+	).toBeVisible();
+	await page.getByRole("button", { name: "Apply Filters" }).click();
+
+	const filteredRow = page
+		.locator("tbody tr")
+		.filter({ hasText: itemId })
+		.first();
+	await expect(filteredRow).toBeVisible();
+	await filteredRow.getByRole("button", { name: "Expand tags" }).click();
+	await expect(filteredRow).toContainText(normalizedManualTag);
+});
diff --git a/frontend/tests/test-helpers.ts b/frontend/tests/test-helpers.ts
new file mode 100644
index 0000000..a6f1623
--- /dev/null
+++ b/frontend/tests/test-helpers.ts
@@ -0,0 +1,29 @@
+/**
+ * Test data helpers for creating GroundTruthItem fixtures.
+ * After Phase 6: canonical state is history[]; question/answer are derived.
+ */
+import type { GroundTruthItem } from "../src/models/groundTruth";
+
+export function makeTestItem(
+	overrides: Partial<GroundTruthItem> & {
+		/** Shorthand: creates a two-turn conversation from question/answer strings */
+		simpleQA?: { question: string; answer: string };
+	} = {},
+): GroundTruthItem {
+	const { simpleQA, ...rest } = overrides;
+
+	const baseHistory = simpleQA
+		? [
+				{ role: "user", content: simpleQA.question, turnId: "turn_1" },
+				{ role: "agent", content: simpleQA.answer, turnId: "turn_2" },
+			]
+		: [];
+
+	return {
+		id: "test-item",
+		providerId: "test",
+		status: "draft",
+		history: baseHistory,
+		...rest,
+	};
+}
diff --git a/frontend/tests/unit/adapters/apiMapper.test.ts b/frontend/tests/unit/adapters/apiMapper.test.ts
index fbfa249..7731f6b 100644
--- a/frontend/tests/unit/adapters/apiMapper.test.ts
+++ b/frontend/tests/unit/adapters/apiMapper.test.ts
@@ -5,6 +5,7 @@ import {
 	groundTruthToPatch,
 } from "../../../src/adapters/apiMapper";
 import type { GroundTruthItem } from "../../../src/models/groundTruth";
+import { getItemReferences } from "../../../src/models/groundTruth";
 
 function makeApiItem(overrides: Partial<ApiGroundTruth> = {}): ApiGroundTruth {
 	return {
@@ -108,7 +109,8 @@ describe("groundTruthFromApi", () => {
 			const result = groundTruthFromApi(api);
 
 			// Refs from history[1] should have messageIndex 1
-			const refsAt1 = result.references.filter((r) => r.messageIndex === 1);
+			const allRefs = getItemReferences(result);
+			const refsAt1 = allRefs.filter((r) => r.messageIndex === 1);
 			expect(refsAt1).toHaveLength(2);
 			expect(refsAt1.map((r) => r.url)).toEqual([
 				"https://ref1.com",
@@ -116,7 +118,7 @@ describe("groundTruthFromApi", () => {
 			]);
 
 			// Refs from history[3] should have messageIndex 3
-			const refsAt3 = result.references.filter((r) => r.messageIndex === 3);
+			const refsAt3 = allRefs.filter((r) => r.messageIndex === 3);
 			expect(refsAt3).toHaveLength(1);
 			expect(refsAt3[0].url).toBe("https://ref3.com");
 		});
@@ -141,7 +143,7 @@ describe("groundTruthFromApi", () => {
 				],
 			});
 			const result = groundTruthFromApi(api);
-			const ref = result.references[0];
+			const ref = getItemReferences(result)[0];
 
 			expect(ref.url).toBe("https://example.com");
 			expect(ref.title).toBe("Example Title");
@@ -200,8 +202,8 @@ describe("groundTruthFromApi", () => {
 			});
 			const result = groundTruthFromApi(api);
 
-			expect(result.references).toHaveLength(1);
-			expect(result.references[0].messageIndex).toBe(1);
+			expect(getItemReferences(result)).toHaveLength(1);
+			expect(getItemReferences(result)[0].messageIndex).toBe(1);
 		});
 
 		it("creates empty agent turn when answer is empty", () => {
@@ -230,8 +232,8 @@ describe("groundTruthFromApi", () => {
 			});
 			const result = groundTruthFromApi(api);
 
-			expect(result.references).toHaveLength(1);
-			expect(result.references[0].messageIndex).toBeUndefined();
+			expect(getItemReferences(result)).toHaveLength(1);
+			expect(getItemReferences(result)[0].messageIndex).toBeUndefined();
 		});
 	});
 
@@ -303,6 +305,39 @@ describe("groundTruthFromApi", () => {
 			expect(result.computedTags).toEqual([]);
 		});
 	});
+
+	describe("tool call handling", () => {
+		it("normalizes null tool call arguments to undefined", () => {
+			const api = makeApiItem({
+				toolCalls: [
+					{
+						id: "tc-1",
+						name: "search_docs",
+						callType: "tool",
+						arguments: null,
+					},
+				],
+			});
+			const result = groundTruthFromApi(api);
+
+			expect(result.toolCalls).toHaveLength(1);
+			expect(result.toolCalls?.[0]).toMatchObject({
+				id: "tc-1",
+				name: "search_docs",
+				callType: "tool",
+			});
+			expect(result.toolCalls?.[0].arguments).toBeUndefined();
+		});
+	});
+
+	describe("contextEntries handling", () => {
+		it("preserves explicit empty contextEntries arrays", () => {
+			const api = makeApiItem({ contextEntries: [] });
+			const result = groundTruthFromApi(api);
+
+			expect(result.contextEntries).toEqual([]);
+		});
+	});
 });
 
 describe("groundTruthToPatch", () => {
@@ -318,7 +353,6 @@ describe("groundTruthToPatch", () => {
 			deleted: false,
 			tags: [],
 			manualTags: [],
-			references: [],
 			...overrides,
 		};
 	}
@@ -345,10 +379,22 @@ describe("groundTruthToPatch", () => {
 					{ role: "user", content: "Q" },
 					{ role: "agent", content: "A" },
 				],
-				references: [
-					{ id: "r1", url: "https://ref.com", messageIndex: 1 },
-					{ id: "r2", url: "https://user-ref.com", messageIndex: 0 }, // Should be ignored
-				],
+				plugins: {
+					"rag-compat": {
+						kind: "rag-compat",
+						version: "1.0",
+						data: {
+							retrievals: {
+								_unassociated: {
+									candidates: [
+										{ url: "https://ref.com", messageIndex: 1 },
+										{ url: "https://user-ref.com", messageIndex: 0 },
+									],
+								},
+							},
+						},
+					},
+				},
 			});
 			const patch = groundTruthToPatch({ item });
 
@@ -370,10 +416,22 @@ describe("groundTruthToPatch", () => {
 					{ role: "user", content: "Q" },
 					{ role: "agent", content: "A" },
 				],
-				references: [
-					{ id: "r1", url: "https://legacy.ref", messageIndex: 1 },
-					{ id: "r2", url: "https://new.ref", messageIndex: 1 },
-				],
+				plugins: {
+					"rag-compat": {
+						kind: "rag-compat",
+						version: "1.0",
+						data: {
+							retrievals: {
+								_unassociated: {
+									candidates: [
+										{ url: "https://legacy.ref", messageIndex: 1 },
+										{ url: "https://new.ref", messageIndex: 1 },
+									],
+								},
+							},
+						},
+					},
+				},
 			});
 			const patch = groundTruthToPatch({ item, originalApi });
 
@@ -383,6 +441,42 @@ describe("groundTruthToPatch", () => {
 			expect(patch.refs?.map((r) => r.url)).toContain("https://new.ref");
 		});
 
+		it("preserves top-level refs when legacy items use empty history arrays", () => {
+			const originalApi = makeApiItem({
+				history: [],
+				refs: [{ url: "https://legacy-empty.ref", bonus: false }],
+			});
+			const item = makeDomainItem({
+				history: [
+					{ role: "user", content: "Q" },
+					{ role: "agent", content: "A" },
+				],
+				plugins: {
+					"rag-compat": {
+						kind: "rag-compat",
+						version: "1.0",
+						data: {
+							retrievals: {
+								_unassociated: {
+									candidates: [
+										{ url: "https://legacy-empty.ref", messageIndex: 1 },
+										{ url: "https://new-empty.ref", messageIndex: 1 },
+									],
+								},
+							},
+						},
+					},
+				},
+			});
+			const patch = groundTruthToPatch({ item, originalApi });
+
+			expect(patch.refs).toHaveLength(2);
+			expect(patch.refs?.map((r) => r.url)).toContain(
+				"https://legacy-empty.ref",
+			);
+			expect(patch.refs?.map((r) => r.url)).toContain("https://new-empty.ref");
+		});
+
 		it("omits top-level refs for true multi-turn items", () => {
 			const originalApi = makeApiItem({
 				history: [
@@ -400,7 +494,19 @@ describe("groundTruthToPatch", () => {
 					{ role: "user", content: "Q" },
 					{ role: "agent", content: "A" },
 				],
-				references: [{ id: "r1", url: "https://turn.ref", messageIndex: 1 }],
+				plugins: {
+					"rag-compat": {
+						kind: "rag-compat",
+						version: "1.0",
+						data: {
+							retrievals: {
+								_unassociated: {
+									candidates: [{ url: "https://turn.ref", messageIndex: 1 }],
+								},
+							},
+						},
+					},
+				},
 			});
 			const patch = groundTruthToPatch({ item, originalApi });
 
@@ -417,17 +523,28 @@ describe("groundTruthToPatch", () => {
 					{ role: "user", content: "Q" },
 					{ role: "agent", content: "A" },
 				],
-				references: [
-					{
-						id: "r1",
-						url: "https://example.com",
-						title: "Title",
-						snippet: "Snippet",
-						keyParagraph: "Key",
-						bonus: true,
-						messageIndex: 1,
+				plugins: {
+					"rag-compat": {
+						kind: "rag-compat",
+						version: "1.0",
+						data: {
+							retrievals: {
+								_unassociated: {
+									candidates: [
+										{
+											url: "https://example.com",
+											title: "Title",
+											chunk: "Snippet",
+											keyParagraph: "Key",
+											bonus: true,
+											messageIndex: 1,
+										},
+									],
+								},
+							},
+						},
 					},
-				],
+				},
 			});
 			const patch = groundTruthToPatch({ item });
 			const ref = patch.history?.[1].refs?.[0];
@@ -529,4 +646,21 @@ describe("groundTruthToPatch", () => {
 			expect(patch.manualTags).toEqual(["tag1", "tag2"]);
 		});
 	});
+
+	describe("contextEntries serialization", () => {
+		it("includes explicit empty contextEntries arrays in the patch", () => {
+			const item = makeDomainItem({ contextEntries: [] });
+			const patch = groundTruthToPatch({ item });
+
+			expect(patch).toHaveProperty("contextEntries");
+			expect((patch as Record<string, unknown>).contextEntries).toEqual([]);
+		});
+
+		it("omits undefined contextEntries from the patch", () => {
+			const item = makeDomainItem({ contextEntries: undefined });
+			const patch = groundTruthToPatch({ item });
+
+			expect("contextEntries" in patch).toBe(false);
+		});
+	});
 });
diff --git a/frontend/tests/unit/adapters/apiProvider-etag.test.ts b/frontend/tests/unit/adapters/apiProvider-etag.test.ts
index 8e62a1a..4377739 100644
--- a/frontend/tests/unit/adapters/apiProvider-etag.test.ts
+++ b/frontend/tests/unit/adapters/apiProvider-etag.test.ts
@@ -3,7 +3,15 @@ import { ApiProvider } from "../../../src/adapters/apiProvider";
 import type { components } from "../../../src/api/generated";
 import type { GroundTruthItem } from "../../../src/models/groundTruth";
 
-type ApiItem = components["schemas"]["GroundTruthItem-Output"];
+type ApiItem = components["schemas"]["AgenticGroundTruthEntry-Output"] & {
+	synthQuestion?: string | null;
+	editedQuestion?: string | null;
+	answer?: string | null;
+	refs?: components["schemas"]["Reference"][];
+	totalReferences?: number;
+	tags?: string[];
+	comment?: string | null;
+};
 
 const {
 	mockGetMyAssignments,
diff --git a/frontend/tests/unit/components/app/CurateLayout.integration.test.tsx b/frontend/tests/unit/components/app/CurateLayout.integration.test.tsx
index d10755d..b61d784 100644
--- a/frontend/tests/unit/components/app/CurateLayout.integration.test.tsx
+++ b/frontend/tests/unit/components/app/CurateLayout.integration.test.tsx
@@ -2,7 +2,6 @@ import { fireEvent, render, screen } from "@testing-library/react";
 import { useState } from "react";
 import CuratePane from "../../../../src/components/app/pages/CuratePane";
 import QueueSidebar from "../../../../src/components/app/QueueSidebar";
-import type { AgentGenerationResult } from "../../../../src/hooks/useGroundTruth";
 import type { GroundTruthItem } from "../../../../src/models/groundTruth";
 
 // Minimal harness to exercise interactions between the sidebar and the editor
@@ -15,7 +14,6 @@ function MiniCurateApp() {
 			question: "Q-1",
 			answer: "",
 			history: [{ role: "user", content: "Q-1" }],
-			references: [],
 			status: "draft",
 			providerId: "json",
 			tags: [],
@@ -25,7 +23,6 @@ function MiniCurateApp() {
 			question: "Q-2",
 			answer: "",
 			history: [{ role: "user", content: "Q-2" }],
-			references: [],
 			status: "draft",
 			providerId: "json",
 			tags: [],
@@ -72,16 +69,7 @@ function MiniCurateApp() {
 						canApprove={true}
 						saving={false}
 						onDuplicate={() => void 0}
-						onUpdateQuestion={(q) => {
-							setItems((arr) =>
-								arr.map((i) =>
-									i.id === selectedId ? { ...i, question: q } : i,
-								),
-							);
-							setUnsaved(true);
-						}}
 						onUpdateComment={() => void 0}
-						onUpdateAnswer={() => void 0}
 						onUpdateTags={() => void 0}
 						onUpdateHistory={(history) => {
 							setItems((arr) =>
@@ -94,19 +82,11 @@ function MiniCurateApp() {
 							setUnsaved(true);
 						}}
 						onDeleteTurn={() => void 0}
-						onGenerateAgentTurn={async (): Promise<AgentGenerationResult> => ({
-							ok: true as const,
-							messageIndex: 0,
-						})}
 						onSaveDraft={() => void 0}
 						onApprove={() => void 0}
 						onSkip={() => void 0}
 						onDelete={() => void 0}
 						onRestore={() => void 0}
-						onUpdateReference={() => void 0}
-						onRemoveReference={() => void 0}
-						onOpenReference={() => void 0}
-						onAddReferences={() => void 0}
 					/>
 				</div>
 			) : (
diff --git a/frontend/tests/unit/components/app/QuestionsExplorer.test.tsx b/frontend/tests/unit/components/app/QuestionsExplorer.test.tsx
index 4f40761..2b292a7 100644
--- a/frontend/tests/unit/components/app/QuestionsExplorer.test.tsx
+++ b/frontend/tests/unit/components/app/QuestionsExplorer.test.tsx
@@ -1,16 +1,56 @@
-import { fireEvent, render, screen } from "@testing-library/react";
+import {
+	act,
+	fireEvent,
+	render,
+	screen,
+	waitFor,
+} from "@testing-library/react";
 import { beforeEach, describe, expect, it, vi } from "vitest";
 import QuestionsExplorer, {
 	type QuestionsExplorerItem,
 } from "../../../../src/components/app/QuestionsExplorer";
 
+const serviceMocks = vi.hoisted(() => ({
+	listAllGroundTruths: vi.fn(),
+	fetchTagsWithComputed: vi.fn(),
+	subscribeToTagMetadata: vi.fn(() => () => {}),
+	getTagMetadataSnapshot: vi.fn(() => ({
+		manualTags: [],
+		computedTags: [],
+		allTags: [],
+		loading: false,
+		error: null,
+	})),
+	fetchAvailableDatasets: vi.fn(),
+}));
+
+vi.mock("../../../../src/services/groundTruths", async () => {
+	const actual = await vi.importActual<
+		typeof import("../../../../src/services/groundTruths")
+	>("../../../../src/services/groundTruths");
+
+	return {
+		...actual,
+		listAllGroundTruths: serviceMocks.listAllGroundTruths,
+	};
+});
+
+vi.mock("../../../../src/services/tags", () => ({
+	fetchTagsWithComputed: serviceMocks.fetchTagsWithComputed,
+	subscribeToTagMetadata: serviceMocks.subscribeToTagMetadata,
+	getTagMetadataSnapshot: serviceMocks.getTagMetadataSnapshot,
+}));
+
+vi.mock("../../../../src/services/datasets", () => ({
+	fetchAvailableDatasets: serviceMocks.fetchAvailableDatasets,
+}));
+
 const createMockItem = (
 	overrides: Partial<QuestionsExplorerItem> = {},
 ): QuestionsExplorerItem => ({
 	id: "item-1",
 	question: "Test Question",
 	answer: "Test Answer",
-	references: [],
 	status: "draft",
 	providerId: "test",
 	...overrides,
@@ -28,13 +68,78 @@ describe("QuestionsExplorer", () => {
 		onDelete: mockOnDelete,
 	};
 
+	const renderQuestionsExplorer = async (
+		props: Partial<{
+			items: QuestionsExplorerItem[] | undefined;
+			onAssign: typeof mockOnAssign;
+			onInspect: typeof mockOnInspect;
+			onDelete: typeof mockOnDelete;
+		}> = {},
+	) => {
+		render(<QuestionsExplorer {...defaultProps} {...props} />);
+		await waitFor(() => {
+			expect(serviceMocks.fetchTagsWithComputed).toHaveBeenCalledTimes(1);
+			expect(serviceMocks.fetchAvailableDatasets).toHaveBeenCalledTimes(1);
+		});
+	};
+
 	beforeEach(() => {
 		mockOnAssign.mockClear();
 		mockOnInspect.mockClear();
 		mockOnDelete.mockClear();
+		window.history.replaceState({}, "", "/");
+		serviceMocks.listAllGroundTruths.mockReset();
+		serviceMocks.fetchTagsWithComputed.mockReset();
+		serviceMocks.subscribeToTagMetadata.mockClear();
+		serviceMocks.getTagMetadataSnapshot.mockClear();
+		serviceMocks.fetchAvailableDatasets.mockReset();
+		serviceMocks.fetchTagsWithComputed.mockResolvedValue({
+			manualTags: [],
+			computedTags: [],
+		});
+		serviceMocks.getTagMetadataSnapshot.mockReturnValue({
+			manualTags: [],
+			computedTags: [],
+			allTags: [],
+			loading: false,
+			error: null,
+		});
+		serviceMocks.fetchAvailableDatasets.mockResolvedValue([]);
+		serviceMocks.listAllGroundTruths.mockImplementation(
+			async (
+				params: {
+					itemId?: string;
+					refUrl?: string;
+					keyword?: string;
+					page?: number;
+				} = {},
+			) => {
+				const hasTextFilter =
+					Boolean(params.itemId) ||
+					Boolean(params.refUrl) ||
+					Boolean(params.keyword);
+				const page = typeof params.page === "number" ? params.page : 1;
+				const totalPages = hasTextFilter ? 1 : 3;
+
+				return {
+					items: [
+						createMockItem({
+							id: hasTextFilter ? "filtered-item" : `page-${page}-item`,
+							question: hasTextFilter
+								? "Filtered Question"
+								: `Question from page ${page}`,
+						}),
+					],
+					pagination: {
+						total: hasTextFilter ? 1 : 75,
+						totalPages,
+					},
+				};
+			},
+		);
 	});
 
-	it("should render all items when no filter is active", () => {
+	it("should render all items when no filter is active", async () => {
 		const items: QuestionsExplorerItem[] = [
 			createMockItem({ id: "1", status: "draft", question: "Draft Q" }),
 			createMockItem({ id: "2", status: "approved", question: "Approved Q" }),
@@ -46,19 +151,19 @@ describe("QuestionsExplorer", () => {
 			}),
 		];
 
-		render(<QuestionsExplorer {...defaultProps} items={items} />);
+		await renderQuestionsExplorer({ items });
 
 		expect(screen.getByText("Draft Q")).toBeInTheDocument();
 		expect(screen.getByText("Approved Q")).toBeInTheDocument();
 		expect(screen.getByText("Deleted Q")).toBeInTheDocument();
 	});
 
-	it("should call onAssign when Assign button clicked", () => {
+	it("should call onAssign when Assign button clicked", async () => {
 		const items: QuestionsExplorerItem[] = [
 			createMockItem({ id: "test-123", question: "Test Q" }),
 		];
 
-		render(<QuestionsExplorer {...defaultProps} items={items} />);
+		await renderQuestionsExplorer({ items });
 
 		const assignButton = screen.getByRole("button", {
 			name: "Assign test-123",
@@ -69,12 +174,12 @@ describe("QuestionsExplorer", () => {
 		expect(mockOnAssign).toHaveBeenCalledTimes(1);
 	});
 
-	it("should call onInspect when Inspect button clicked", () => {
+	it("should call onInspect when Inspect button clicked", async () => {
 		const items: QuestionsExplorerItem[] = [
 			createMockItem({ id: "test-456", question: "Test Q" }),
 		];
 
-		render(<QuestionsExplorer {...defaultProps} items={items} />);
+		await renderQuestionsExplorer({ items });
 
 		const inspectButton = screen.getByRole("button", {
 			name: "Inspect test-456",
@@ -85,45 +190,47 @@ describe("QuestionsExplorer", () => {
 		expect(mockOnInspect).toHaveBeenCalledTimes(1);
 	});
 
-	it("should call onDelete when Delete button clicked", () => {
+	it("should call onDelete when Delete button clicked", async () => {
 		const items: QuestionsExplorerItem[] = [
 			createMockItem({ id: "test-789", question: "Test Q" }),
 		];
 
-		render(<QuestionsExplorer {...defaultProps} items={items} />);
+		await renderQuestionsExplorer({ items });
 
 		const deleteButton = screen.getByRole("button", {
 			name: "Delete test-789",
 		});
-		fireEvent.click(deleteButton);
+		await act(async () => {
+			fireEvent.click(deleteButton);
+		});
 
 		// onDelete is called with the full item object
 		expect(mockOnDelete).toHaveBeenCalledWith(items[0]);
 		expect(mockOnDelete).toHaveBeenCalledTimes(1);
 	});
 
-	it("should show item count correctly", () => {
+	it("should show item count correctly", async () => {
 		const items: QuestionsExplorerItem[] = [
 			createMockItem({ id: "1", question: "Q1" }),
 			createMockItem({ id: "2", question: "Q2" }),
 			createMockItem({ id: "3", question: "Q3" }),
 		];
 
-		render(<QuestionsExplorer {...defaultProps} items={items} />);
+		await renderQuestionsExplorer({ items });
 
 		expect(screen.getByText("Showing 3 of 3 items")).toBeInTheDocument();
 	});
 
 	// Pagination Tests
 	describe("Pagination", () => {
-		it("should not show pagination controls when items fit in one page", () => {
+		it("should not show pagination controls when items fit in one page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 10 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			expect(
 				screen.queryByRole("button", { name: "Previous" }),
@@ -133,14 +240,14 @@ describe("QuestionsExplorer", () => {
 			).not.toBeInTheDocument();
 		});
 
-		it("should show pagination controls when items exceed one page", () => {
+		it("should show pagination controls when items exceed one page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 30 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			expect(
 				screen.getByRole("button", { name: "Previous" }),
@@ -149,28 +256,28 @@ describe("QuestionsExplorer", () => {
 			expect(screen.getByText("Page 1 of 2")).toBeInTheDocument();
 		});
 
-		it("should display correct number of items per page", () => {
+		it("should display correct number of items per page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 30 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			// With items prop, all items are displayed regardless of pagination settings
 			// The count shows all 30 items
 			expect(screen.getByText("Showing 30 of 30 items")).toBeInTheDocument();
 		});
 
-		it("should navigate to next page", () => {
+		it("should navigate to next page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 30 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			const nextButton = screen.getByRole("button", { name: "Next" });
 			fireEvent.click(nextButton);
@@ -181,14 +288,14 @@ describe("QuestionsExplorer", () => {
 			expect(screen.getByText("Showing 30 of 30 items")).toBeInTheDocument();
 		});
 
-		it("should navigate to previous page", () => {
+		it("should navigate to previous page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 30 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			const nextButton = screen.getByRole("button", { name: "Next" });
 			fireEvent.click(nextButton);
@@ -202,27 +309,27 @@ describe("QuestionsExplorer", () => {
 			expect(screen.getByText("Showing 30 of 30 items")).toBeInTheDocument();
 		});
 
-		it("should disable Previous button on first page", () => {
+		it("should disable Previous button on first page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 30 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			const previousButton = screen.getByRole("button", { name: "Previous" });
 			expect(previousButton).toBeDisabled();
 		});
 
-		it("should disable Next button on last page", () => {
+		it("should disable Next button on last page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 30 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			const nextButton = screen.getByRole("button", { name: "Next" });
 			fireEvent.click(nextButton);
@@ -231,14 +338,14 @@ describe("QuestionsExplorer", () => {
 			expect(nextButton).toBeDisabled();
 		});
 
-		it("should navigate to specific page by clicking page number", () => {
+		it("should navigate to specific page by clicking page number", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 60 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			// Should have 3 pages (60 items / 25 per page = 2.4, rounded up to 3)
 			const page2Button = screen.getByRole("button", { name: "2" });
@@ -247,14 +354,14 @@ describe("QuestionsExplorer", () => {
 			expect(screen.getByText("Page 2 of 3")).toBeInTheDocument();
 		});
 
-		it("should highlight current page number", () => {
+		it("should highlight current page number", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 60 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			// Page 1 button should have active styling (bg-blue-500)
 			const page1Button = screen.getByRole("button", { name: "1" });
@@ -265,14 +372,14 @@ describe("QuestionsExplorer", () => {
 			expect(page2Button.className).not.toContain("bg-blue-500");
 		});
 
-		it("should reset to page 1 when changing items per page from a different page", () => {
+		it("should reset to page 1 when changing items per page from a different page", async () => {
 			const items: QuestionsExplorerItem[] = Array.from(
 				{ length: 60 },
 				(_, i) =>
 					createMockItem({ id: `item-${i}`, question: `Question ${i}` }),
 			);
 
-			render(<QuestionsExplorer {...defaultProps} items={items} />);
+			await renderQuestionsExplorer({ items });
 
 			// Go to page 2
 			const nextButton = screen.getByRole("button", { name: "Next" });
@@ -286,5 +393,62 @@ describe("QuestionsExplorer", () => {
 			// Should reset to page 1
 			expect(screen.getByText("Page 1 of 2")).toBeInTheDocument();
 		});
+
+		it.each([
+			{
+				name: "item ID",
+				label: "Item ID:",
+				value: "item-42",
+				expectedFilter: { itemId: "item-42" },
+			},
+			{
+				name: "reference URL",
+				label: "Reference URL:",
+				value: "https://example.com/ref",
+				expectedFilter: { refUrl: "https://example.com/ref" },
+			},
+			{
+				name: "keyword",
+				label: "Keyword Search:",
+				value: "agentic",
+				expectedFilter: { keyword: "agentic" },
+			},
+		])("should reset to page 1 when applying a $name filter from page 2", async ({
+			label,
+			value,
+			expectedFilter,
+		}) => {
+			await renderQuestionsExplorer({ items: undefined });
+
+			await waitFor(() => {
+				expect(screen.getByText("Page 1 of 3")).toBeInTheDocument();
+			});
+
+			fireEvent.click(screen.getByRole("button", { name: "2" }));
+
+			await waitFor(() => {
+				expect(screen.getByText("Page 2 of 3")).toBeInTheDocument();
+			});
+
+			fireEvent.change(screen.getByLabelText(label), {
+				target: { value },
+			});
+			fireEvent.click(screen.getByRole("button", { name: "Apply Filters" }));
+
+			await waitFor(() => {
+				expect(serviceMocks.listAllGroundTruths).toHaveBeenLastCalledWith(
+					expect.objectContaining({
+						page: 1,
+						...expectedFilter,
+					}),
+					expect.any(AbortSignal),
+				);
+			});
+
+			await waitFor(() => {
+				expect(screen.getByText("Filtered Question")).toBeInTheDocument();
+				expect(screen.getByText("Showing 1 of 1 items")).toBeInTheDocument();
+			});
+		});
 	});
 });
diff --git a/frontend/tests/unit/components/app/ReferencesTabs.multiturn.test.tsx b/frontend/tests/unit/components/app/ReferencesTabs.multiturn.test.tsx
index fb6908b..9657b1b 100644
--- a/frontend/tests/unit/components/app/ReferencesTabs.multiturn.test.tsx
+++ b/frontend/tests/unit/components/app/ReferencesTabs.multiturn.test.tsx
@@ -1,4 +1,4 @@
-import { render, screen } from "@testing-library/react";
+import { render, screen, waitFor } from "@testing-library/react";
 import type { RefObject } from "react";
 import type { RightTab } from "../../../../src/components/app/ReferencesPanel/ReferencesTabs";
 import ReferencesTabs from "../../../../src/components/app/ReferencesPanel/ReferencesTabs";
@@ -26,15 +26,37 @@ describe("ReferencesTabs multi-turn gating", () => {
 		isMultiTurn: false,
 	});
 
-	it("forces selected tab and hides search when multi-turn is active", () => {
+	it("forces selected tab and hides search when multi-turn is active", async () => {
 		const props = makeProps();
 		props.rightTab = "search";
 		props.isMultiTurn = true;
 		render(<ReferencesTabs {...props} />);
-		expect(props.setRightTab).toHaveBeenCalledWith("selected");
+		await waitFor(() => {
+			expect(props.setRightTab).toHaveBeenCalledWith("selected");
+		});
 		expect(screen.queryByRole("button", { name: /search/i })).toBeNull();
 	});
 
+	it("does not update parent state during render in multi-turn mode", () => {
+		const props = makeProps();
+		props.rightTab = "search";
+		props.isMultiTurn = true;
+		const consoleError = vi
+			.spyOn(console, "error")
+			.mockImplementation(() => undefined);
+		render(<ReferencesTabs {...props} />);
+		expect(
+			consoleError.mock.calls.some((call) =>
+				call.some(
+					(value) =>
+						typeof value === "string" &&
+						value.includes("Cannot update a component while rendering"),
+				),
+			),
+		).toBe(false);
+		consoleError.mockRestore();
+	});
+
 	it("renders guidance banner in multi-turn mode", () => {
 		const props = makeProps();
 		props.isMultiTurn = true;
diff --git a/frontend/tests/unit/components/app/conversation-turn-add-reference.test.tsx b/frontend/tests/unit/components/app/conversation-turn-add-reference.test.tsx
deleted file mode 100644
index 5a967a5..0000000
--- a/frontend/tests/unit/components/app/conversation-turn-add-reference.test.tsx
+++ /dev/null
@@ -1,158 +0,0 @@
-import { fireEvent, render, screen } from "@testing-library/react";
-import { describe, expect, it, vi } from "vitest";
-import ConversationTurn from "../../../../src/components/app/editor/ConversationTurn";
-import type { ConversationTurn as ConversationTurnType } from "../../../../src/models/groundTruth";
-
-describe("ConversationTurn - Add Reference Button", () => {
-	const mockAgentTurn: ConversationTurnType = {
-		role: "agent",
-		content: "This is an agent response",
-		expectedBehavior: ["generation:answer"],
-	};
-
-	const mockUserTurn: ConversationTurnType = {
-		role: "user",
-		content: "This is a user question",
-	};
-
-	it("renders Add reference button for agent turn with no references", () => {
-		const mockOnViewReferences = vi.fn();
-
-		render(
-			<ConversationTurn
-				turn={mockAgentTurn}
-				index={1}
-				isLast={false}
-				onUpdate={vi.fn()}
-				onUpdateExpectedBehavior={vi.fn()}
-				onDelete={vi.fn()}
-				onRegenerate={vi.fn()}
-				canEdit={true}
-				isGenerating={false}
-				referenceCount={0}
-				onViewReferences={mockOnViewReferences}
-			/>,
-		);
-
-		const addButton = screen.getByRole("button", { name: /add reference/i });
-		expect(addButton).toBeInTheDocument();
-		expect(addButton).toHaveTextContent("Add reference");
-	});
-
-	it("does not render Add reference button for user turns", () => {
-		render(
-			<ConversationTurn
-				turn={mockUserTurn}
-				index={0}
-				isLast={false}
-				onUpdate={vi.fn()}
-				onDelete={vi.fn()}
-				canEdit={true}
-				isGenerating={false}
-				referenceCount={0}
-				onViewReferences={vi.fn()}
-			/>,
-		);
-
-		const addButton = screen.queryByRole("button", { name: /add reference/i });
-		expect(addButton).not.toBeInTheDocument();
-	});
-
-	it("does not render Add reference button when references exist", () => {
-		const mockOnViewReferences = vi.fn();
-
-		render(
-			<ConversationTurn
-				turn={mockAgentTurn}
-				index={1}
-				isLast={false}
-				onUpdate={vi.fn()}
-				onUpdateExpectedBehavior={vi.fn()}
-				onDelete={vi.fn()}
-				onRegenerate={vi.fn()}
-				canEdit={true}
-				isGenerating={false}
-				referenceCount={2}
-				onViewReferences={mockOnViewReferences}
-			/>,
-		);
-
-		const addButton = screen.queryByRole("button", { name: /add reference/i });
-		expect(addButton).not.toBeInTheDocument();
-
-		// Should show the count button instead
-		const countButton = screen.getByRole("button", { name: /2 references/i });
-		expect(countButton).toBeInTheDocument();
-	});
-
-	it("calls onViewReferences when Add reference clicked", () => {
-		const mockOnViewReferences = vi.fn();
-
-		render(
-			<ConversationTurn
-				turn={mockAgentTurn}
-				index={1}
-				isLast={false}
-				onUpdate={vi.fn()}
-				onUpdateExpectedBehavior={vi.fn()}
-				onDelete={vi.fn()}
-				onRegenerate={vi.fn()}
-				canEdit={true}
-				isGenerating={false}
-				referenceCount={0}
-				onViewReferences={mockOnViewReferences}
-			/>,
-		);
-
-		const addButton = screen.getByRole("button", { name: /add reference/i });
-		fireEvent.click(addButton);
-
-		expect(mockOnViewReferences).toHaveBeenCalledTimes(1);
-	});
-
-	it("Add reference button uses same styling as count button", () => {
-		const mockOnViewReferences = vi.fn();
-
-		const { rerender } = render(
-			<ConversationTurn
-				turn={mockAgentTurn}
-				index={1}
-				isLast={false}
-				onUpdate={vi.fn()}
-				onUpdateExpectedBehavior={vi.fn()}
-				onDelete={vi.fn()}
-				onRegenerate={vi.fn()}
-				canEdit={true}
-				isGenerating={false}
-				referenceCount={0}
-				onViewReferences={mockOnViewReferences}
-			/>,
-		);
-
-		const addButton = screen.getByRole("button", { name: /add reference/i });
-		const addButtonClasses = addButton.className;
-
-		// Rerender with references to get the count button
-		rerender(
-			<ConversationTurn
-				turn={mockAgentTurn}
-				index={1}
-				isLast={false}
-				onUpdate={vi.fn()}
-				onUpdateExpectedBehavior={vi.fn()}
-				onDelete={vi.fn()}
-				onRegenerate={vi.fn()}
-				canEdit={true}
-				isGenerating={false}
-				referenceCount={1}
-				onViewReferences={mockOnViewReferences}
-			/>,
-		);
-
-		const countButton = screen.getByRole("button", { name: /1 reference$/i });
-		const countButtonClasses = countButton.className;
-
-		// Both buttons should have the same styling classes
-		expect(addButtonClasses).toBe(countButtonClasses);
-	});
-});
diff --git a/frontend/tests/unit/components/app/editors/ToolNecessityEditor.test.tsx b/frontend/tests/unit/components/app/editors/ToolNecessityEditor.test.tsx
new file mode 100644
index 0000000..5e8fc88
--- /dev/null
+++ b/frontend/tests/unit/components/app/editors/ToolNecessityEditor.test.tsx
@@ -0,0 +1,178 @@
+import { cleanup, fireEvent, render, screen } from "@testing-library/react";
+import { afterEach, describe, expect, it, vi } from "vitest";
+import ToolNecessityEditor from "../../../../../src/components/app/editors/ToolNecessityEditor";
+import type {
+	ExpectedTools,
+	ToolCallRecord,
+} from "../../../../../src/models/groundTruth";
+
+afterEach(cleanup);
+
+const toolCalls: ToolCallRecord[] = [
+	{ id: "1", name: "search", callType: "tool" },
+	{ id: "2", name: "lookup", callType: "tool" },
+];
+
+describe("ToolNecessityEditor", () => {
+	it("renders a row for each unique tool name", () => {
+		const onUpdate = vi.fn();
+		render(
+			<ToolNecessityEditor
+				toolCalls={toolCalls}
+				expectedTools={undefined}
+				onUpdate={onUpdate}
+			/>,
+		);
+		expect(screen.getByText("search")).toBeInTheDocument();
+		expect(screen.getByText("lookup")).toBeInTheDocument();
+	});
+
+	it("includes tool names from expectedTools that are not in toolCalls", () => {
+		const onUpdate = vi.fn();
+		const expected: ExpectedTools = {
+			required: [{ name: "summarize" }],
+		};
+		render(
+			<ToolNecessityEditor
+				toolCalls={toolCalls}
+				expectedTools={expected}
+				onUpdate={onUpdate}
+			/>,
+		);
+		expect(screen.getByText("summarize")).toBeInTheDocument();
+	});
+
+	it("shows correct active state for a required tool", () => {
+		const onUpdate = vi.fn();
+		const expected: ExpectedTools = {
+			required: [{ name: "search" }],
+		};
+		render(
+			<ToolNecessityEditor
+				toolCalls={toolCalls}
+				expectedTools={expected}
+				onUpdate={onUpdate}
+			/>,
+		);
+		const requiredBtn = screen.getByRole("button", {
+			name: "Set search to Required",
+		});
+		expect(requiredBtn).toHaveAttribute("aria-pressed", "true");
+	});
+
+	it("toggles tool from required to optional", () => {
+		const onUpdate = vi.fn();
+		const expected: ExpectedTools = {
+			required: [{ name: "search" }],
+			optional: [{ name: "lookup" }],
+		};
+		render(
+			<ToolNecessityEditor
+				toolCalls={toolCalls}
+				expectedTools={expected}
+				onUpdate={onUpdate}
+			/>,
+		);
+		const optionalBtn = screen.getByRole("button", {
+			name: "Set search to Optional",
+		});
+		fireEvent.click(optionalBtn);
+
+		expect(onUpdate).toHaveBeenCalledOnce();
+		const result = onUpdate.mock.calls[0][0] as ExpectedTools;
+		expect(result.required?.map((t) => t.name) ?? []).not.toContain("search");
+		expect(result.optional?.map((t) => t.name) ?? []).toContain("search");
+	});
+
+	it("toggles tool from optional to not-needed", () => {
+		const onUpdate = vi.fn();
+		const expected: ExpectedTools = {
+			optional: [{ name: "search" }],
+		};
+		render(
+			<ToolNecessityEditor
+				toolCalls={toolCalls}
+				expectedTools={expected}
+				onUpdate={onUpdate}
+			/>,
+		);
+		const notNeededBtn = screen.getByRole("button", {
+			name: "Set search to Not needed",
+		});
+		fireEvent.click(notNeededBtn);
+
+		expect(onUpdate).toHaveBeenCalledOnce();
+		const result = onUpdate.mock.calls[0][0] as ExpectedTools;
+		expect(result.optional?.map((t) => t.name) ?? []).not.toContain("search");
+		expect(result.notNeeded?.map((t) => t.name) ?? []).toContain("search");
+	});
+
+	it("preserves expectation arguments when moving a tool between buckets", () => {
+		const onUpdate = vi.fn();
+		const expected: ExpectedTools = {
+			optional: [
+				{
+					name: "search",
+					arguments: { query: "ground truth curator" },
+				},
+			],
+		};
+		render(
+			<ToolNecessityEditor
+				toolCalls={toolCalls}
+				expectedTools={expected}
+				onUpdate={onUpdate}
+			/>,
+		);
+		fireEvent.click(
+			screen.getByRole("button", {
+				name: "Set search to Required",
+			}),
+		);
+
+		expect(onUpdate).toHaveBeenCalledOnce();
+		const result = onUpdate.mock.calls[0][0] as ExpectedTools;
+		expect(result.optional?.map((t) => t.name) ?? []).not.toContain("search");
+		expect(result.required).toEqual([
+			{
+				name: "search",
+				arguments: { query: "ground truth curator" },
+			},
+		]);
+	});
+
+	it("shows empty state when no tool calls exist", () => {
+		const onUpdate = vi.fn();
+		render(
+			<ToolNecessityEditor
+				toolCalls={[]}
+				expectedTools={undefined}
+				onUpdate={onUpdate}
+			/>,
+		);
+		expect(screen.getByText("No tool calls to classify.")).toBeInTheDocument();
+	});
+
+	it("preserves other tools when toggling one tool", () => {
+		const onUpdate = vi.fn();
+		const expected: ExpectedTools = {
+			required: [{ name: "search" }, { name: "lookup" }],
+		};
+		render(
+			<ToolNecessityEditor
+				toolCalls={toolCalls}
+				expectedTools={expected}
+				onUpdate={onUpdate}
+			/>,
+		);
+		const optionalBtn = screen.getByRole("button", {
+			name: "Set search to Optional",
+		});
+		fireEvent.click(optionalBtn);
+
+		const result = onUpdate.mock.calls[0][0] as ExpectedTools;
+		expect(result.required?.map((t) => t.name) ?? []).toContain("lookup");
+		expect(result.required?.map((t) => t.name) ?? []).not.toContain("search");
+		expect(result.optional?.map((t) => t.name) ?? []).toContain("search");
+	});
+});
diff --git a/frontend/tests/unit/components/app/pages/CuratePane.test.tsx b/frontend/tests/unit/components/app/pages/CuratePane.test.tsx
index 09c2015..9498d62 100644
--- a/frontend/tests/unit/components/app/pages/CuratePane.test.tsx
+++ b/frontend/tests/unit/components/app/pages/CuratePane.test.tsx
@@ -1,13 +1,11 @@
 import { render, screen } from "@testing-library/react";
 import CuratePane from "../../../../../src/components/app/pages/CuratePane";
-import type { AgentGenerationResult } from "../../../../../src/hooks/useGroundTruth";
 import type { GroundTruthItem } from "../../../../../src/models/groundTruth";
 
 const item: GroundTruthItem = {
 	id: "1",
 	question: "What is this software?",
 	answer: "",
-	references: [],
 	status: "draft",
 	providerId: "json",
 	tags: [],
@@ -21,24 +19,15 @@ describe("CuratePane", () => {
 				canApprove={true}
 				saving={false}
 				onDuplicate={vi.fn()}
-				onUpdateQuestion={vi.fn()}
-				onUpdateAnswer={vi.fn()}
 				onUpdateComment={vi.fn()}
 				onUpdateTags={vi.fn()}
 				onUpdateHistory={vi.fn()}
 				onDeleteTurn={vi.fn()}
-				onGenerateAgentTurn={async (): Promise<AgentGenerationResult> => ({
-					ok: true as const,
-					messageIndex: 0,
-				})}
 				onSaveDraft={vi.fn()}
 				onApprove={vi.fn()}
 				onSkip={vi.fn()}
 				onDelete={vi.fn()}
 				onRestore={vi.fn()}
-				onUpdateReference={vi.fn()}
-				onRemoveReference={vi.fn()}
-				onOpenReference={vi.fn()}
 			/>,
 		);
 
@@ -60,24 +49,15 @@ describe("CuratePane", () => {
 				canApprove={true}
 				saving={false}
 				onDuplicate={vi.fn()}
-				onUpdateQuestion={vi.fn()}
-				onUpdateAnswer={vi.fn()}
 				onUpdateComment={vi.fn()}
 				onUpdateTags={vi.fn()}
 				onUpdateHistory={vi.fn()}
 				onDeleteTurn={vi.fn()}
-				onGenerateAgentTurn={async (): Promise<AgentGenerationResult> => ({
-					ok: true as const,
-					messageIndex: 0,
-				})}
 				onSaveDraft={vi.fn()}
 				onApprove={vi.fn()}
 				onSkip={vi.fn()}
 				onDelete={vi.fn()}
 				onRestore={vi.fn()}
-				onUpdateReference={vi.fn()}
-				onRemoveReference={vi.fn()}
-				onOpenReference={vi.fn()}
 			/>,
 		);
 
@@ -87,4 +67,39 @@ describe("CuratePane", () => {
 		expect(screen.queryByLabelText("Question")).not.toBeInTheDocument();
 		expect(screen.queryByLabelText("Answer")).not.toBeInTheDocument();
 	});
+
+	it("allows adding any turn type after a preserved non-user role", () => {
+		const itemWithHistory: GroundTruthItem = {
+			...item,
+			history: [
+				{ role: "user", content: "What is this software?" },
+				{ role: "output-agent", content: "It is a CAD software." },
+			],
+		};
+
+		render(
+			<CuratePane
+				current={itemWithHistory}
+				canApprove={true}
+				saving={false}
+				onDuplicate={vi.fn()}
+				onUpdateComment={vi.fn()}
+				onUpdateTags={vi.fn()}
+				onUpdateHistory={vi.fn()}
+				onDeleteTurn={vi.fn()}
+				onSaveDraft={vi.fn()}
+				onApprove={vi.fn()}
+				onSkip={vi.fn()}
+				onDelete={vi.fn()}
+				onRestore={vi.fn()}
+			/>,
+		);
+
+		expect(
+			screen.getByRole("button", { name: /Add User Turn/i }),
+		).toBeEnabled();
+		expect(
+			screen.getByRole("button", { name: /Add Agent Turn/i }),
+		).toBeEnabled();
+	});
 });
diff --git a/frontend/tests/unit/components/app/pages/QuestionsList.test.tsx b/frontend/tests/unit/components/app/pages/QuestionsList.test.tsx
index eceaefb..36d7247 100644
--- a/frontend/tests/unit/components/app/pages/QuestionsList.test.tsx
+++ b/frontend/tests/unit/components/app/pages/QuestionsList.test.tsx
@@ -10,7 +10,6 @@ const mkItem = (id: string, deleted = false): GroundTruthItem => ({
 	status: "draft",
 	providerId: "json",
 	deleted,
-	references: [],
 });
 
 describe("QuestionsList", () => {
diff --git a/frontend/tests/unit/components/app/pages/ReferencesSection.test.tsx b/frontend/tests/unit/components/app/pages/ReferencesSection.test.tsx
index de59a44..87a07dd 100644
--- a/frontend/tests/unit/components/app/pages/ReferencesSection.test.tsx
+++ b/frontend/tests/unit/components/app/pages/ReferencesSection.test.tsx
@@ -1,4 +1,4 @@
-import { fireEvent, render, screen } from "@testing-library/react";
+import { act, fireEvent, render, screen } from "@testing-library/react";
 import ReferencesSection from "../../../../../src/components/app/pages/ReferencesSection";
 import type { Reference } from "../../../../../src/models/groundTruth";
 
@@ -41,7 +41,9 @@ describe("ReferencesSection", () => {
 		expect(onAddRefs).toHaveBeenCalled();
 
 		// Run search
-		fireEvent.click(screen.getAllByRole("button", { name: /Search/i })[1]);
+		await act(async () => {
+			fireEvent.click(screen.getAllByRole("button", { name: /Search/i })[1]);
+		});
 		expect(onRunSearch).toHaveBeenCalled();
 	});
 
@@ -71,3 +73,134 @@ describe("ReferencesSection", () => {
 		expect(onRemoveReference).toHaveBeenCalled();
 	});
 });
+
+// ---------------------------------------------------------------------------
+// Phase 4: ReferencesSection as generic right pane
+// ---------------------------------------------------------------------------
+import type { GroundTruthItem } from "../../../../../src/models/groundTruth";
+import { getItemReferences } from "../../../../../src/models/groundTruth";
+
+const makeItem = (
+	overrides: Partial<GroundTruthItem> = {},
+): GroundTruthItem => ({
+	id: "i1",
+	question: "Q",
+	answer: "A",
+	status: "draft",
+	providerId: "test",
+	...overrides,
+});
+
+describe("ReferencesSection – generic right pane (Phase 4)", () => {
+	const noopProps = {
+		query: "",
+		setQuery: vi.fn(),
+		searching: false,
+		searchResults: [],
+		onRunSearch: vi.fn(),
+		onAddRefs: vi.fn(),
+		references: [],
+		onUpdateReference: vi.fn(),
+		onRemoveReference: vi.fn(),
+		onOpenReference: vi.fn(),
+	};
+
+	it("shows TracePanel when item has toolCalls", () => {
+		const item = makeItem({
+			toolCalls: [{ id: "tc1", name: "search", callType: "tool" }],
+		});
+		render(<ReferencesSection {...noopProps} item={item} isMultiTurn />);
+		expect(screen.getByText(/Evidence & Review/i)).toBeInTheDocument();
+	});
+
+	it("shows RAG compat panel when in single-turn mode", () => {
+		render(
+			<ReferencesSection {...noopProps} item={null} isMultiTurn={false} />,
+		);
+		// Search tab should be visible (RAG compat surface)
+		const searchBtns = screen.getAllByRole("button", { name: /Search/i });
+		expect(searchBtns.length).toBeGreaterThan(0);
+	});
+
+	it("shows empty state when multi-turn mode and no evidence or references", () => {
+		const item = makeItem(); // no toolCalls, no traceIds, etc.
+		render(<ReferencesSection {...noopProps} item={item} isMultiTurn />);
+		// No references, no evidence data → empty state
+		expect(
+			screen.getByText(/No evidence or references available/i),
+		).toBeInTheDocument();
+	});
+
+	it("shows TracePanel header when item has expectedTools", () => {
+		const item = makeItem({
+			expectedTools: {
+				required: [{ name: "search" }],
+			},
+			toolCalls: [],
+		});
+		render(<ReferencesSection {...noopProps} item={item} isMultiTurn />);
+		expect(screen.getByText(/Evidence & Review/i)).toBeInTheDocument();
+	});
+
+	it("shows generic evidence for context-only items", () => {
+		const item = makeItem({
+			contextEntries: [
+				{ key: "customer_tier", value: "enterprise" },
+				{ key: "request", value: { region: "us" } },
+			],
+		});
+		render(<ReferencesSection {...noopProps} item={item} isMultiTurn />);
+		expect(screen.getByText(/Evidence & Review/i)).toBeInTheDocument();
+		// Context Entries is inside "More Details" collapsible
+		expect(screen.getByText(/More Details/i)).toBeInTheDocument();
+	});
+
+	it("shows generic plugin-owned details for plugin-only items", () => {
+		const item = makeItem({
+			plugins: {
+				"rag-compat": {
+					kind: "retrieval-review",
+					version: "1",
+					data: {
+						retrievalMode: "semantic",
+						latencyMs: 42,
+					},
+				},
+			},
+		});
+		render(<ReferencesSection {...noopProps} item={item} isMultiTurn />);
+		expect(screen.getByText(/Evidence & Review/i)).toBeInTheDocument();
+		// Plugin Details is inside "More Details" collapsible
+		expect(screen.getByText(/More Details/i)).toBeInTheDocument();
+	});
+
+	it("shows only evidence panel when multi-turn item has references", () => {
+		const item = makeItem({
+			toolCalls: [{ id: "tc1", name: "search", callType: "tool" }],
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1.0",
+					data: {
+						retrievals: {
+							_unassociated: {
+								candidates: [{ url: "https://example.com" }],
+							},
+						},
+					},
+				},
+			},
+		});
+		render(
+			<ReferencesSection
+				{...noopProps}
+				item={item}
+				references={getItemReferences(item)}
+				isMultiTurn
+			/>,
+		);
+		expect(screen.getByText(/Evidence & Review/i)).toBeInTheDocument();
+		// RAG references panel is hidden in multi-turn mode
+		expect(screen.queryByText(/Selected/i)).not.toBeInTheDocument();
+	});
+});
diff --git a/frontend/tests/unit/components/app/pages/StatsPage.test.tsx b/frontend/tests/unit/components/app/pages/StatsPage.test.tsx
index 1cfacd9..4e0e134 100644
--- a/frontend/tests/unit/components/app/pages/StatsPage.test.tsx
+++ b/frontend/tests/unit/components/app/pages/StatsPage.test.tsx
@@ -1,15 +1,20 @@
 import { fireEvent, render, screen, waitFor } from "@testing-library/react";
 import StatsPage from "../../../../../src/components/app/pages/StatsPage";
 import type { StatsPayload } from "../../../../../src/components/app/StatsView";
+import * as demoConfig from "../../../../../src/config/demo";
 import type { GroundTruthItem } from "../../../../../src/models/groundTruth";
 import * as statsSvc from "../../../../../src/services/stats";
 
 vi.mock("../../../../../src/services/stats");
+vi.mock("../../../../../src/config/demo", () => ({
+	shouldUseDemoProvider: vi.fn(() => false),
+}));
 
 describe("StatsPage", () => {
 	const items: GroundTruthItem[] = [] as GroundTruthItem[];
 
 	it("renders happy path stats", async () => {
+		vi.mocked(demoConfig.shouldUseDemoProvider).mockReturnValue(false);
 		(
 			statsSvc.getGroundTruthStats as unknown as {
 				mockResolvedValue: (v: StatsPayload) => void;
@@ -37,6 +42,7 @@ describe("StatsPage", () => {
 	});
 
 	it("falls back to zero on error", async () => {
+		vi.mocked(demoConfig.shouldUseDemoProvider).mockReturnValue(false);
 		(
 			statsSvc.getGroundTruthStats as unknown as {
 				mockRejectedValue: (e: unknown) => void;
@@ -51,6 +57,7 @@ describe("StatsPage", () => {
 	});
 
 	it("uses mock service when demoMode", async () => {
+		vi.mocked(demoConfig.shouldUseDemoProvider).mockReturnValue(true);
 		(
 			statsSvc.mockGetGroundTruthStats as unknown as {
 				mockResolvedValue: (v: StatsPayload) => void;
@@ -68,6 +75,7 @@ describe("StatsPage", () => {
 	});
 
 	it("calls onBack when Back clicked", async () => {
+		vi.mocked(demoConfig.shouldUseDemoProvider).mockReturnValue(true);
 		(
 			statsSvc.mockGetGroundTruthStats as unknown as {
 				mockResolvedValue: (v: StatsPayload) => void;
@@ -84,4 +92,22 @@ describe("StatsPage", () => {
 		fireEvent.click(screen.getByRole("button", { name: /Back/i }));
 		expect(onBack).toHaveBeenCalled();
 	});
+
+	it("uses backend stats in API-backed demo mode", async () => {
+		vi.mocked(demoConfig.shouldUseDemoProvider).mockReturnValue(false);
+		(
+			statsSvc.getGroundTruthStats as unknown as {
+				mockResolvedValue: (v: StatsPayload) => void;
+			}
+		).mockResolvedValue({
+			total: { approved: 4, draft: 1, deleted: 0 },
+			perSprint: [],
+		});
+
+		render(<StatsPage demoMode items={items} onBack={vi.fn()} />);
+
+		await waitFor(() =>
+			expect(statsSvc.getGroundTruthStats).toHaveBeenCalled(),
+		);
+	});
 });
diff --git a/frontend/tests/unit/components/modals/InspectItemModal.test.tsx b/frontend/tests/unit/components/modals/InspectItemModal.test.tsx
new file mode 100644
index 0000000..f63f055
--- /dev/null
+++ b/frontend/tests/unit/components/modals/InspectItemModal.test.tsx
@@ -0,0 +1,112 @@
+import { render, screen, waitFor } from "@testing-library/react";
+import { beforeEach, describe, expect, it, vi } from "vitest";
+import InspectItemModal from "../../../../src/components/modals/InspectItemModal";
+import { useGroundTruthCache } from "../../../../src/hooks/useGroundTruthCache";
+import type { GroundTruthItem } from "../../../../src/models/groundTruth";
+
+const serviceMocks = vi.hoisted(() => ({
+	getGroundTruth: vi.fn(),
+}));
+
+vi.mock("../../../../src/services/groundTruths", () => ({
+	getGroundTruth: serviceMocks.getGroundTruth,
+}));
+
+vi.mock("../../../../src/hooks/useModalKeys", () => ({
+	default: vi.fn(),
+}));
+
+vi.mock("../../../../src/components/app/editor/MultiTurnEditor", () => ({
+	default: ({ current }: { current: GroundTruthItem }) => (
+		<div data-testid="multi-turn-editor">{current.id}</div>
+	),
+}));
+
+vi.mock("../../../../src/components/common/TagChip", () => ({
+	default: ({ tag, isComputed }: { tag: string; isComputed?: boolean }) => (
+		<span>{isComputed ? `computed:${tag}` : `manual:${tag}`}</span>
+	),
+}));
+
+function createItem(overrides: Partial<GroundTruthItem> = {}): GroundTruthItem {
+	return {
+		id: "item-123",
+		status: "draft",
+		providerId: "api",
+		datasetName: "list-dataset",
+		bucket: "list-bucket",
+		manualTags: ["stale-manual"],
+		computedTags: ["stale-computed"],
+		comment: "Stale list comment",
+		reviewedAt: "2026-03-01T12:00:00.000Z",
+		history: [
+			{ role: "user", content: "Original question", turnId: "turn-1" },
+			{ role: "agent", content: "Original answer", turnId: "turn-2" },
+		],
+		...overrides,
+	};
+}
+
+describe("InspectItemModal", () => {
+	beforeEach(() => {
+		serviceMocks.getGroundTruth.mockReset();
+		useGroundTruthCache().clear();
+	});
+
+	it("renders fetched metadata from the refreshed item", async () => {
+		const listItem = createItem();
+		const fetchedItem = createItem({
+			status: "approved",
+			datasetName: "fresh-dataset",
+			bucket: "fresh-bucket",
+			manualTags: ["fresh-manual"],
+			computedTags: ["fresh-computed"],
+			comment: "Fresh item comment",
+			reviewedAt: "2026-03-13T08:45:00.000Z",
+		});
+		serviceMocks.getGroundTruth.mockResolvedValue(fetchedItem);
+
+		render(
+			<InspectItemModal isOpen={true} item={listItem} onClose={vi.fn()} />,
+		);
+
+		await waitFor(() => {
+			expect(screen.getByText("approved")).toBeInTheDocument();
+		});
+
+		expect(screen.getByText("fresh-dataset")).toBeInTheDocument();
+		expect(screen.getByText("fresh-bucket")).toBeInTheDocument();
+		expect(screen.getByText("Fresh item comment")).toBeInTheDocument();
+		expect(screen.getByText("computed:fresh-computed")).toBeInTheDocument();
+		expect(screen.getByText("manual:fresh-manual")).toBeInTheDocument();
+
+		expect(screen.queryByText("Stale list comment")).not.toBeInTheDocument();
+		expect(
+			screen.queryByText("computed:stale-computed"),
+		).not.toBeInTheDocument();
+		expect(screen.queryByText("manual:stale-manual")).not.toBeInTheDocument();
+	});
+
+	it("falls back to the original item metadata when the full fetch fails", async () => {
+		const listItem = createItem();
+		serviceMocks.getGroundTruth.mockRejectedValue(new Error("Fetch failed"));
+
+		render(
+			<InspectItemModal isOpen={true} item={listItem} onClose={vi.fn()} />,
+		);
+
+		await waitFor(() => {
+			expect(screen.getByText("Fetch failed")).toBeInTheDocument();
+		});
+
+		expect(screen.getByText("draft")).toBeInTheDocument();
+		expect(screen.getByText("list-dataset")).toBeInTheDocument();
+		expect(screen.getByText("list-bucket")).toBeInTheDocument();
+		expect(screen.getByText("Stale list comment")).toBeInTheDocument();
+		expect(screen.getByText("computed:stale-computed")).toBeInTheDocument();
+		expect(screen.getByText("manual:stale-manual")).toBeInTheDocument();
+		expect(screen.getByTestId("multi-turn-editor")).toHaveTextContent(
+			"item-123",
+		);
+	});
+});
diff --git a/frontend/tests/unit/config/demo-config.test.ts b/frontend/tests/unit/config/demo-config.test.ts
new file mode 100644
index 0000000..0ec549b
--- /dev/null
+++ b/frontend/tests/unit/config/demo-config.test.ts
@@ -0,0 +1,26 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+
+describe("demo config parsing", () => {
+	afterEach(() => {
+		vi.resetModules();
+		vi.unstubAllEnvs();
+	});
+
+	it("treats truthy demo mode as API-backed by default", async () => {
+		vi.stubEnv("VITE_DEMO_MODE", "true");
+		const config = await import("../../../src/config/demo");
+
+		expect(config.default).toBe(true);
+		expect(config.getDemoDataSource()).toBe("api");
+		expect(config.shouldUseDemoProvider()).toBe(false);
+	});
+
+	it("uses the JSON provider only when explicitly requested", async () => {
+		vi.stubEnv("VITE_DEMO_MODE", "json");
+		const config = await import("../../../src/config/demo");
+
+		expect(config.default).toBe(true);
+		expect(config.getDemoDataSource()).toBe("json");
+		expect(config.shouldUseDemoProvider()).toBe(true);
+	});
+});
diff --git a/frontend/tests/unit/demo.assign-flow.test.ts b/frontend/tests/unit/demo.assign-flow.test.ts
new file mode 100644
index 0000000..bd3a344
--- /dev/null
+++ b/frontend/tests/unit/demo.assign-flow.test.ts
@@ -0,0 +1,32 @@
+import { describe, expect, it, vi } from "vitest";
+
+import { resolveExplorerAssignSelection } from "../../src/demo";
+
+describe("resolveExplorerAssignSelection", () => {
+	it("switches to curate and returns a success toast when selection succeeds", async () => {
+		const selectItem = vi.fn().mockResolvedValue(true);
+
+		await expect(
+			resolveExplorerAssignSelection("item-123", selectItem),
+		).resolves.toEqual({
+			switchToCurate: true,
+			toastKind: "success",
+			toastMessage: "Assigned item-123 for curation",
+		});
+		expect(selectItem).toHaveBeenCalledWith("item-123");
+	});
+
+	it("keeps the user in explorer and returns an info toast when selection is cancelled or fails", async () => {
+		const selectItem = vi.fn().mockResolvedValue(false);
+
+		await expect(
+			resolveExplorerAssignSelection("item-123", selectItem),
+		).resolves.toEqual({
+			switchToCurate: false,
+			toastKind: "info",
+			toastMessage:
+				"Assigned item-123, but opening it in curate was cancelled or failed.",
+		});
+		expect(selectItem).toHaveBeenCalledWith("item-123");
+	});
+});
diff --git a/frontend/tests/unit/demo.inspect-cache.test.ts b/frontend/tests/unit/demo.inspect-cache.test.ts
new file mode 100644
index 0000000..05c5ac6
--- /dev/null
+++ b/frontend/tests/unit/demo.inspect-cache.test.ts
@@ -0,0 +1,41 @@
+import { beforeEach, describe, expect, it, vi } from "vitest";
+
+const cacheMocks = vi.hoisted(() => ({
+	invalidateGroundTruthCache: vi.fn(),
+}));
+
+vi.mock("../../src/hooks/useGroundTruthCache", () => ({
+	invalidateGroundTruthCache: cacheMocks.invalidateGroundTruthCache,
+}));
+
+import { invalidateInspectCacheForExplorerItem } from "../../src/demo";
+
+describe("invalidateInspectCacheForExplorerItem", () => {
+	beforeEach(() => {
+		cacheMocks.invalidateGroundTruthCache.mockClear();
+	});
+
+	it("invalidates the inspect cache when explorer items have full identifiers", () => {
+		invalidateInspectCacheForExplorerItem({
+			id: "item-123",
+			datasetName: "dataset-a",
+			bucket: "bucket-a",
+		});
+
+		expect(cacheMocks.invalidateGroundTruthCache).toHaveBeenCalledWith(
+			"dataset-a",
+			"bucket-a",
+			"item-123",
+		);
+	});
+
+	it("skips invalidation when explorer item metadata is incomplete", () => {
+		invalidateInspectCacheForExplorerItem({
+			id: "item-123",
+			datasetName: undefined,
+			bucket: "bucket-a",
+		});
+
+		expect(cacheMocks.invalidateGroundTruthCache).not.toHaveBeenCalled();
+	});
+});
diff --git a/frontend/tests/unit/error-boundary.test.tsx b/frontend/tests/unit/error-boundary.test.tsx
index f668750..eb3f537 100644
--- a/frontend/tests/unit/error-boundary.test.tsx
+++ b/frontend/tests/unit/error-boundary.test.tsx
@@ -1,12 +1,17 @@
 import { render } from "@testing-library/react";
 import type React from "react";
-import { beforeEach, describe, expect, it, vi } from "vitest";
+import { afterEach, beforeEach, describe, expect, it, vi } from "vitest";
 import ErrorBoundary from "../../src/components/common/ErrorBoundary";
 import * as telemetry from "../../src/services/telemetry";
 
 describe("ErrorBoundary", () => {
 	beforeEach(() => {
 		vi.spyOn(telemetry, "logException").mockImplementation(() => {});
+		vi.spyOn(console, "error").mockImplementation(() => {});
+	});
+
+	afterEach(() => {
+		vi.restoreAllMocks();
 	});
 
 	it("renders fallback on error and logs exception", () => {
diff --git a/frontend/tests/unit/hooks/useGroundTruth-deleteTurn.test.tsx b/frontend/tests/unit/hooks/useGroundTruth-deleteTurn.test.tsx
index 222fc1f..50f7545 100644
--- a/frontend/tests/unit/hooks/useGroundTruth-deleteTurn.test.tsx
+++ b/frontend/tests/unit/hooks/useGroundTruth-deleteTurn.test.tsx
@@ -1,7 +1,7 @@
 import { act, renderHook, waitFor } from "@testing-library/react";
 import type { ConversationTurn } from "../../../src/models/groundTruth";
+import { getItemReferences } from "../../../src/models/groundTruth";
 
-// Force demo mode so the hook uses JsonProvider with DEMO_JSON
 vi.mock("../../../src/config/demo", () => ({
 	default: true,
 	DEMO_MODE: true,
@@ -11,385 +11,204 @@ vi.mock("../../../src/config/demo", () => ({
 
 let useGroundTruth: typeof import("../../../src/hooks/useGroundTruth").default;
 beforeAll(async () => {
-	// Ensure modules are re-evaluated with our mock in place
 	vi.resetModules();
 	({ default: useGroundTruth } = await import(
 		"../../../src/hooks/useGroundTruth"
 	));
 });
 
-describe("useGroundTruth deleteTurn", () => {
-	it("should delete a middle turn and re-index references correctly", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		// Wait for initial list load
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
-		});
+function requireCurrent<T>(current: T | null | undefined): T {
+	expect(current).toBeTruthy();
+	if (!current) {
+		throw new Error("Expected current item to be loaded");
+	}
+	return current;
+}
+
+async function loadHook() {
+	const hook = renderHook(() => useGroundTruth());
+	await waitFor(() => {
+		expect(hook.result.current.current).toBeTruthy();
+	});
+	return hook;
+}
+
+async function seedHistory(
+	result: { current: ReturnType<typeof useGroundTruth> },
+	history: ConversationTurn[],
+) {
+	await act(async () => {
+		result.current.updateHistory(history);
+	});
+	await act(async () => {
+		if (result.current.current) {
+			result.current.current.plugins = {};
+		}
+	});
+}
 
-		// Setup multi-turn conversation with references
+describe("useGroundTruth deleteTurn", () => {
+	it("preserves stable turn ownership when deleting a middle turn", async () => {
+		const { result } = await loadHook();
 		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-			{ role: "user", content: "Second question" },
-			{ role: "agent", content: "Second answer" },
-			{ role: "user", content: "Third question" },
-			{ role: "agent", content: "Third answer" },
+			{ role: "user", content: "First question", turnId: "turn-user-1" },
+			{ role: "agent", content: "First answer", turnId: "turn-agent-1" },
+			{ role: "user", content: "Second question", turnId: "turn-user-2" },
+			{ role: "agent", content: "Second answer", turnId: "turn-agent-2" },
+			{ role: "user", content: "Third question", turnId: "turn-user-3" },
+			{ role: "agent", content: "Third answer", turnId: "turn-agent-3" },
 		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
-		// Clear existing references first, then add test references
-		await act(async () => {
-			if (result.current.current) {
-				result.current.current.references = [];
-			}
-		});
-
-		// Add references for different turns
+		await seedHistory(result, history);
 		await act(async () => {
 			result.current.addReferences([
-				{ id: "ref1", url: "http://example.com/1", messageIndex: 1 },
-				{ id: "ref2", url: "http://example.com/2", messageIndex: 3 },
-				{ id: "ref3", url: "http://example.com/3", messageIndex: 5 },
+				{ id: "ref1", url: "http://example.com/1", turnId: "turn-agent-1" },
+				{ id: "ref2", url: "http://example.com/2", turnId: "turn-agent-2" },
+				{ id: "ref3", url: "http://example.com/3", turnId: "turn-agent-3" },
 			]);
 		});
-
-		expect(result.current.current?.history?.length).toBe(6);
-		expect(result.current.current?.references?.length).toBe(3);
-
-		// Delete turn at index 2 (second user turn)
 		await act(async () => {
 			result.current.deleteTurn(2);
 		});
-
-		// Verify turn was deleted
+		const refs = getItemReferences(requireCurrent(result.current.current));
 		expect(result.current.current?.history?.length).toBe(5);
-		expect(result.current.current?.history?.[2].content).toBe("Second answer");
-
-		// Verify references were re-indexed
-		// When we delete turn 2, references are NOT deleted for turn 2,
-		// they are deleted if they HAVE messageIndex === 2
-		// In our case: ref1 has messageIndex 1, ref2 has messageIndex 3, ref3 has messageIndex 5
-		// None have messageIndex === 2, so all 3 should remain, just shifted down
-		const refs = result.current.current?.references || [];
-		expect(refs.length).toBe(3); // All refs remain, just re-indexed
-
-		const ref1 = refs.find((r) => r.id === "ref1");
-		const ref2 = refs.find((r) => r.id === "ref2");
-		const ref3 = refs.find((r) => r.id === "ref3");
-
-		expect(ref1?.messageIndex).toBe(1); // Unchanged (before deleted turn)
-		expect(ref2?.messageIndex).toBe(2); // Shifted down from 3 to 2
-		expect(ref3?.messageIndex).toBe(4); // Shifted down from 5 to 4
-	});
-
-	it("should delete the first turn", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
+		expect(result.current.current?.history?.[2].turnId).toBe("turn-agent-2");
+		expect(refs).toHaveLength(3);
+		expect(
+			refs.find((ref) => ref.url === "http://example.com/1"),
+		).toMatchObject({
+			turnId: "turn-agent-1",
+		});
+		expect(
+			refs.find((ref) => ref.url === "http://example.com/2"),
+		).toMatchObject({
+			turnId: "turn-agent-2",
+		});
+		expect(
+			refs.find((ref) => ref.url === "http://example.com/3"),
+		).toMatchObject({
+			turnId: "turn-agent-3",
 		});
+	});
 
+	it("removes references owned by the deleted turn while preserving later turn ids", async () => {
+		const { result } = await loadHook();
 		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-			{ role: "user", content: "Second question" },
+			{ role: "user", content: "First question", turnId: "turn-user-1" },
+			{ role: "agent", content: "First answer", turnId: "turn-agent-1" },
+			{ role: "user", content: "Second question", turnId: "turn-user-2" },
+			{ role: "agent", content: "Second answer", turnId: "turn-agent-2" },
 		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
-		// Clear existing references first
-		await act(async () => {
-			if (result.current.current) {
-				result.current.current.references = [];
-			}
-		});
-
+		await seedHistory(result, history);
 		await act(async () => {
 			result.current.addReferences([
-				{ id: "ref1", url: "http://example.com/1", messageIndex: 1 },
-				{ id: "ref2", url: "http://example.com/2", messageIndex: 2 },
+				{
+					id: "ref1-turn1",
+					url: "http://example.com/1",
+					turnId: "turn-agent-1",
+				},
+				{
+					id: "ref2-turn1",
+					url: "http://example.com/2",
+					turnId: "turn-agent-1",
+				},
+				{
+					id: "ref1-turn2",
+					url: "http://example.com/3",
+					turnId: "turn-agent-2",
+				},
 			]);
 		});
-
-		// Delete first turn
 		await act(async () => {
-			result.current.deleteTurn(0);
+			result.current.deleteTurn(1);
 		});
-
-		expect(result.current.current?.history?.length).toBe(2);
-		expect(result.current.current?.history?.[0].content).toBe("First answer");
-
-		// All references should be shifted down
-		const refs = result.current.current?.references || [];
-		expect(refs.find((r) => r.id === "ref1")?.messageIndex).toBe(0);
-		expect(refs.find((r) => r.id === "ref2")?.messageIndex).toBe(1);
-	});
-
-	it("should delete the last turn", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
+		const refs = getItemReferences(requireCurrent(result.current.current));
+		expect(refs).toHaveLength(1);
+		expect(refs[0]).toMatchObject({
+			url: "http://example.com/3",
+			turnId: "turn-agent-2",
 		});
+		expect(result.current.current?.history?.[2].turnId).toBe("turn-agent-2");
+	});
 
+	it("recomputes fallback messageIndex only for references without stable turn ownership", async () => {
+		const { result } = await loadHook();
 		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-			{ role: "user", content: "Second question" },
+			{ role: "user", content: "First question", turnId: "turn-user-1" },
+			{ role: "agent", content: "First answer", turnId: "turn-agent-1" },
+			{ role: "user", content: "Second question", turnId: "turn-user-2" },
 		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
-		// Clear existing references first
-		await act(async () => {
-			if (result.current.current) {
-				result.current.current.references = [];
-			}
-		});
-
+		await seedHistory(result, history);
 		await act(async () => {
 			result.current.addReferences([
-				{ id: "ref1", url: "http://example.com/1", messageIndex: 1 },
-				{ id: "ref2", url: "http://example.com/2", messageIndex: 2 },
+				{
+					id: "stable",
+					url: "http://example.com/stable",
+					turnId: "turn-agent-1",
+				},
+				{ id: "fallback", url: "http://example.com/fallback", messageIndex: 2 },
+				{ id: "global", url: "http://example.com/global" },
 			]);
 		});
-
-		// Delete last turn
 		await act(async () => {
-			result.current.deleteTurn(2);
+			result.current.deleteTurn(0);
 		});
-
-		expect(result.current.current?.history?.length).toBe(2);
-		expect(result.current.current?.history?.[1].content).toBe("First answer");
-
-		// Reference for deleted turn should be removed
-		const refs = result.current.current?.references || [];
-		expect(refs.length).toBe(1);
-		expect(refs.find((r) => r.id === "ref2")).toBeUndefined();
-		expect(refs.find((r) => r.id === "ref1")?.messageIndex).toBe(1);
-	});
-
-	it("should sync question/answer fields after deletion", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
+		const refs = getItemReferences(requireCurrent(result.current.current));
+		expect(
+			refs.find((ref) => ref.url === "http://example.com/stable"),
+		).toMatchObject({
+			turnId: "turn-agent-1",
 		});
+		expect(
+			refs.find((ref) => ref.url === "http://example.com/fallback"),
+		).toMatchObject({
+			messageIndex: 1,
+		});
+		expect(
+			refs.find((ref) => ref.url === "http://example.com/global"),
+		).toBeDefined();
+	});
 
+	it("syncs question and answer projections after deletion", async () => {
+		const { result } = await loadHook();
 		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-			{ role: "user", content: "Second question" },
-			{ role: "agent", content: "Second answer" },
+			{ role: "user", content: "First question", turnId: "turn-user-1" },
+			{ role: "agent", content: "First answer", turnId: "turn-agent-1" },
+			{ role: "user", content: "Second question", turnId: "turn-user-2" },
+			{
+				role: "orchestrator-agent",
+				content: "Second answer",
+				turnId: "turn-agent-2",
+			},
 		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
+		await seedHistory(result, history);
 		expect(result.current.current?.question).toBe("Second question");
 		expect(result.current.current?.answer).toBe("Second answer");
-
-		// Delete last two turns
 		await act(async () => {
-			result.current.deleteTurn(3); // Delete last agent turn
+			result.current.deleteTurn(3);
 		});
-
+		expect(result.current.current?.question).toBe("Second question");
+		expect(result.current.current?.answer).toBe("First answer");
 		await act(async () => {
-			result.current.deleteTurn(2); // Delete last user turn
+			result.current.deleteTurn(2);
 		});
-
-		// Question and answer should sync to remaining turns
 		expect(result.current.current?.question).toBe("First question");
 		expect(result.current.current?.answer).toBe("First answer");
 	});
 
-	it("should handle deletion with no references", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
-		});
-
-		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
-		// Clear existing references
+	it("handles empty or out-of-range deletions without breaking canonical state", async () => {
+		const { result } = await loadHook();
+		await seedHistory(result, [
+			{ role: "user", content: "Only question", turnId: "turn-user-1" },
+		]);
 		await act(async () => {
-			if (result.current.current) {
-				result.current.current.references = [];
-			}
-		});
-
-		// Delete turn without any references
-		await act(async () => {
-			result.current.deleteTurn(0);
-		});
-
-		expect(result.current.current?.history?.length).toBe(1);
-		expect(result.current.current?.references?.length).toBe(0);
-	});
-
-	it("should delete all turns leaving empty history", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
-		});
-
-		const history: ConversationTurn[] = [
-			{ role: "user", content: "Only question" },
-		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
+			result.current.deleteTurn(10);
 		});
-
+		expect(result.current.current?.history).toHaveLength(1);
 		await act(async () => {
 			result.current.deleteTurn(0);
 		});
-
-		expect(result.current.current?.history?.length).toBe(0);
+		expect(result.current.current?.history).toHaveLength(0);
 		expect(result.current.current?.question).toBe("");
 		expect(result.current.current?.answer).toBe("");
 	});
-
-	it("should handle out-of-range index gracefully", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
-		});
-
-		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
-		const beforeLength = result.current.current?.history?.length;
-
-		// Try to delete with invalid index
-		await act(async () => {
-			result.current.deleteTurn(10);
-		});
-
-		// History should remain unchanged
-		expect(result.current.current?.history?.length).toBe(beforeLength);
-	});
-
-	it("should remove references for deleted turn while preserving others", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
-		});
-
-		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-			{ role: "user", content: "Second question" },
-			{ role: "agent", content: "Second answer" },
-		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
-		// Clear existing references first
-		await act(async () => {
-			if (result.current.current) {
-				result.current.current.references = [];
-			}
-		});
-
-		// Add multiple references for the same turn
-		await act(async () => {
-			result.current.addReferences([
-				{ id: "ref1-turn1", url: "http://example.com/1", messageIndex: 1 },
-				{ id: "ref2-turn1", url: "http://example.com/2", messageIndex: 1 },
-				{ id: "ref3-turn1", url: "http://example.com/3", messageIndex: 1 },
-				{ id: "ref1-turn3", url: "http://example.com/4", messageIndex: 3 },
-			]);
-		});
-
-		expect(result.current.current?.references?.length).toBe(4);
-
-		// Delete turn at index 1 (first agent turn)
-		await act(async () => {
-			result.current.deleteTurn(1);
-		});
-
-		// All references for turn 1 should be removed
-		const refs = result.current.current?.references || [];
-		expect(refs.length).toBe(1);
-		expect(refs.find((r) => r.id.includes("turn1"))).toBeUndefined();
-
-		// Reference for turn 3 should remain but re-indexed to turn 2
-		const ref = refs.find((r) => r.id === "ref1-turn3");
-		expect(ref).toBeDefined();
-		expect(ref?.messageIndex).toBe(2); // Shifted from 3 to 2
-	});
-
-	it("should preserve references without messageIndex", async () => {
-		const { result } = renderHook(() => useGroundTruth());
-
-		await waitFor(() => {
-			expect(result.current.current).toBeTruthy();
-		});
-
-		const history: ConversationTurn[] = [
-			{ role: "user", content: "First question" },
-			{ role: "agent", content: "First answer" },
-		];
-
-		await act(async () => {
-			result.current.updateHistory(history);
-		});
-
-		// Clear existing references first
-		await act(async () => {
-			if (result.current.current) {
-				result.current.current.references = [];
-			}
-		});
-
-		// Add references with and without messageIndex
-		await act(async () => {
-			result.current.addReferences([
-				{ id: "ref-with-index", url: "http://example.com/1", messageIndex: 1 },
-				{ id: "ref-without-index", url: "http://example.com/2" }, // No messageIndex
-			]);
-		});
-
-		expect(result.current.current?.references?.length).toBe(2);
-
-		// Delete turn
-		await act(async () => {
-			result.current.deleteTurn(1);
-		});
-
-		// Reference without messageIndex should be preserved
-		const refs = result.current.current?.references || [];
-		expect(refs.length).toBe(1);
-		expect(refs.find((r) => r.id === "ref-without-index")).toBeDefined();
-		expect(refs.find((r) => r.id === "ref-with-index")).toBeUndefined();
-	});
 });
diff --git a/frontend/tests/unit/hooks/useGroundTruth-multiturn.test.tsx b/frontend/tests/unit/hooks/useGroundTruth-multiturn.test.tsx
index 0539e3c..41b00bd 100644
--- a/frontend/tests/unit/hooks/useGroundTruth-multiturn.test.tsx
+++ b/frontend/tests/unit/hooks/useGroundTruth-multiturn.test.tsx
@@ -1,9 +1,6 @@
 import { act, renderHook, waitFor } from "@testing-library/react";
-import type {
-	ConversationTurn,
-	GroundTruthItem,
-	Reference,
-} from "../../../src/models/groundTruth";
+import type { ConversationTurn } from "../../../src/models/groundTruth";
+import { getItemReferences } from "../../../src/models/groundTruth";
 
 vi.mock("../../../src/config/demo", () => ({
 	default: true,
@@ -12,18 +9,6 @@ vi.mock("../../../src/config/demo", () => ({
 	isDemoModeIgnored: () => false,
 }));
 
-const callAgentChatMock = vi.fn();
-
-vi.mock("../../../src/services/chatService", async () => {
-	const actual = await vi.importActual<
-		typeof import("../../../src/services/chatService")
-	>("../../../src/services/chatService");
-	return {
-		...actual,
-		callAgentChat: callAgentChatMock,
-	};
-});
-
 vi.mock("../../../src/services/telemetry", () => ({
 	logEvent: vi.fn(),
 	logException: vi.fn(),
@@ -39,10 +24,6 @@ beforeAll(async () => {
 	));
 });
 
-afterEach(() => {
-	callAgentChatMock.mockReset();
-});
-
 async function setupHook() {
 	const utils = renderHook(() => useGroundTruth());
 	await waitFor(() => {
@@ -61,7 +42,18 @@ describe("useGroundTruth multi-turn flows", () => {
 		await act(async () => {
 			result.current.updateHistory(history);
 		});
-		expect(result.current.current?.history).toEqual(history);
+		expect(result.current.current?.history).toHaveLength(2);
+		expect(result.current.current?.history?.[0]).toMatchObject({
+			role: "user",
+			content: "New question",
+		});
+		expect(result.current.current?.history?.[1]).toMatchObject({
+			role: "agent",
+			content: "Fresh answer",
+		});
+		expect(
+			result.current.current?.history?.every((turn) => !!turn.turnId),
+		).toBe(true);
 		expect(result.current.current?.question).toBe("New question");
 		expect(result.current.current?.answer).toBe("Fresh answer");
 	});
@@ -86,144 +78,113 @@ describe("useGroundTruth multi-turn flows", () => {
 		expect(result.current.current?.answer).toBe("Agent reply");
 	});
 
-	it("generateAgentTurn appends agent response with references", async () => {
+	it("stateSignature ignores visitedAt mutations for hasUnsaved", async () => {
 		const { result } = await setupHook();
+		const before = result.current.hasUnsaved;
+		const current = result.current.current;
+		expect(current).toBeTruthy();
+		if (!current) {
+			throw new Error("Expected current item");
+		}
+		const firstRef = getItemReferences(current)[0];
+		expect(firstRef).toBeTruthy();
 		await act(async () => {
-			result.current.addTurn("user", "Need help with a CAD application");
-		});
-		callAgentChatMock.mockResolvedValue({
-			content: " Generated agent guidance ",
-			references: [
-				{
-					id: "chat-ref-1",
-					title: "Doc",
-					url: "https://docs.example.com/agent",
-					snippet: "Snippet",
-					keyParagraph: "Key",
-				},
-			],
-		});
-		await act(async () => {
-			const res = await result.current.generateAgentTurn(-1);
-			expect(res.ok).toBe(true);
-			expect(res).toMatchObject({ messageIndex: 1 });
-		});
-		expect(callAgentChatMock).toHaveBeenCalledTimes(1);
-		const current = result.current.current as GroundTruthItem;
-		expect(current.history?.length).toBe(2);
-		expect(current.history?.[1]).toMatchObject({
-			role: "agent",
-			content: "Generated agent guidance",
-		});
-		const refsForTurn = (current.references || []).filter(
-			(r) => r.messageIndex === 1,
-		);
-		expect(refsForTurn).toHaveLength(1);
-		expect(refsForTurn[0]).toMatchObject({
-			url: "https://docs.example.com/agent",
-			snippet: "Snippet",
-			keyParagraph: "Key",
+			if (firstRef) {
+				result.current.openReference(firstRef);
+			}
 		});
+		expect(result.current.hasUnsaved).toBe(before);
 	});
 
-	it("generateAgentTurn fails when no prior user turn exists", async () => {
+	it("stateSignature changes when reference messageIndex updates", async () => {
 		const { result } = await setupHook();
-		callAgentChatMock.mockResolvedValue({ content: "x", references: [] });
+		const current = result.current.current;
+		expect(current).toBeTruthy();
+		if (!current) {
+			throw new Error("Expected current item");
+		}
+		const ref = getItemReferences(current)[0];
+		expect(ref).toBeTruthy();
 		await act(async () => {
-			const res = await result.current.generateAgentTurn(-1);
-			expect(res.ok).toBe(false);
-			if (!res.ok) {
-				expect(res.error).toMatch(/user turn/i);
+			if (ref) {
+				result.current.updateReference(ref.id, { messageIndex: 0 });
 			}
 		});
-		expect(callAgentChatMock).not.toHaveBeenCalled();
+		expect(result.current.hasUnsaved).toBe(true);
 	});
 
-	it("regenerateAgentTurn updates only targeted agent turn and references", async () => {
+	it("marks contextEntries-only edits as unsaved and clears them after save", async () => {
 		const { result } = await setupHook();
-		const seedHistory: ConversationTurn[] = [
-			{ role: "user", content: "Original Q" },
-			{ role: "agent", content: "Outdated A" },
-			{ role: "user", content: "Second Q" },
-		];
+		expect(result.current.hasUnsaved).toBe(false);
+
 		await act(async () => {
-			result.current.updateHistory(seedHistory);
-			result.current.addReferences([
-				{
-					id: "turn-ref",
-					title: "Old",
-					url: "https://ref.example.com/old",
-					snippet: "Old snippet",
-					messageIndex: 1,
-				},
-			] as Reference[]);
-		});
-		callAgentChatMock.mockResolvedValue({
-			content: "Updated agent answer",
-			references: [
+			result.current.updateContextEntries([
 				{
-					id: "chat-ref-2",
-					url: "https://ref.example.com/new",
-					snippet: "New snippet",
-					keyParagraph: "New key",
+					key: "test-context-entry",
+					value: { source: "useGroundTruth-multiturn.test.tsx" },
 				},
-			],
+			]);
 		});
+
+		expect(result.current.hasUnsaved).toBe(true);
+
 		await act(async () => {
-			const res = await result.current.regenerateAgentTurn(1);
-			expect(res.ok).toBe(true);
-		});
-		const history = result.current.current?.history ?? [];
-		expect(history[1]).toMatchObject({ content: "Updated agent answer" });
-		expect(history[0]).toMatchObject({ content: "Original Q" });
-		expect(history[2]).toMatchObject({ content: "Second Q" });
-		const refs = result.current.current?.references ?? [];
-		const refsForTurn = refs.filter((r) => r.messageIndex === 1);
-		expect(refsForTurn).toHaveLength(1);
-		expect(refsForTurn[0].url).toBe("https://ref.example.com/new");
+			const saveResult = await result.current.save();
+			expect(saveResult.ok).toBe(true);
+		});
+
+		expect(result.current.hasUnsaved).toBe(false);
+		expect(result.current.current?.contextEntries).toEqual([
+			{
+				key: "test-context-entry",
+				value: { source: "useGroundTruth-multiturn.test.tsx" },
+			},
+		]);
 	});
 
-	it("regenerateAgentTurn rejects non-agent turns", async () => {
+	it("keeps cleared contextEntries cleared after save", async () => {
 		const { result } = await setupHook();
-		const seedHistory: ConversationTurn[] = [
-			{ role: "user", content: "Q" },
-			{ role: "user", content: "Another" },
-		];
+		expect(result.current.current?.contextEntries?.length).toBeGreaterThan(0);
+
 		await act(async () => {
-			result.current.updateHistory(seedHistory);
+			result.current.updateContextEntries([]);
 		});
+
+		expect(result.current.hasUnsaved).toBe(true);
+		expect(result.current.current?.contextEntries).toEqual([]);
+
 		await act(async () => {
-			const res = await result.current.regenerateAgentTurn(1);
-			expect(res.ok).toBe(false);
-			if (!res.ok) {
-				expect(res.error).toMatch(/only agent turns/i);
-			}
+			const saveResult = await result.current.save();
+			expect(saveResult.ok).toBe(true);
 		});
+
+		expect(result.current.hasUnsaved).toBe(false);
+		expect(result.current.current?.contextEntries).toEqual([]);
 	});
 
-	it("stateSignature ignores visitedAt mutations for hasUnsaved", async () => {
+	it("marks expectedTools-only edits as unsaved and clears them after save", async () => {
 		const { result } = await setupHook();
-		const before = result.current.hasUnsaved;
-		const firstRef = result.current.current?.references?.[0];
-		expect(firstRef).toBeTruthy();
+		expect(result.current.hasUnsaved).toBe(false);
+
 		await act(async () => {
-			if (firstRef) {
-				result.current.openReference(firstRef);
-			}
+			result.current.updateExpectedTools({
+				required: [{ name: "test_required_tool" }],
+				optional: [{ name: "test_optional_tool" }],
+			});
 		});
-		expect(result.current.hasUnsaved).toBe(before);
-	});
 
-	it("stateSignature changes when reference messageIndex updates", async () => {
-		const { result } = await setupHook();
-		const ref = result.current.current?.references?.[0];
-		expect(ref).toBeTruthy();
+		expect(result.current.hasUnsaved).toBe(true);
+
 		await act(async () => {
-			if (ref) {
-				result.current.updateReference(ref.id, { messageIndex: 0 });
-			}
+			const saveResult = await result.current.save();
+			expect(saveResult.ok).toBe(true);
+		});
+
+		expect(result.current.hasUnsaved).toBe(false);
+		expect(result.current.current?.expectedTools).toEqual({
+			required: [{ name: "test_required_tool" }],
+			optional: [{ name: "test_optional_tool" }],
 		});
-		expect(result.current.hasUnsaved).toBe(true);
 	});
 
 	it("marks unsaved when history changes", async () => {
diff --git a/frontend/tests/unit/hooks/useGroundTruth-visitedAt-save.test.tsx b/frontend/tests/unit/hooks/useGroundTruth-visitedAt-save.test.tsx
index 0bbd7a8..c51526c 100644
--- a/frontend/tests/unit/hooks/useGroundTruth-visitedAt-save.test.tsx
+++ b/frontend/tests/unit/hooks/useGroundTruth-visitedAt-save.test.tsx
@@ -1,4 +1,5 @@
 import { act, renderHook, waitFor } from "@testing-library/react";
+import { getItemReferences } from "../../../src/models/groundTruth";
 
 // Force API mode (not demo) so we exercise ApiProvider code path
 vi.mock("../../../src/config/demo", () => ({
@@ -34,21 +35,124 @@ const initialApiItem = {
 	synthQuestion: null,
 	answer: "original answer",
 	comment: null,
-	refs: [
+	refs: [],
+	history: [
 		{
-			url: "https://example.com/doc1",
-			content: "Snippet one",
-			keyExcerpt: null,
-			bonus: false,
+			role: "user",
+			msg: "What is test question?",
+			turnId: "turn-user-1",
+			stepId: "step-user-1",
+		},
+		{
+			role: "assistant",
+			msg: "original answer",
+			turnId: "turn-agent-1",
+			stepId: "step-agent-1",
+		},
+		{
+			role: "user",
+			msg: "What about the duplicate URL case?",
+			turnId: "turn-user-2",
+			stepId: "step-user-2",
+		},
+		{
+			role: "assistant",
+			msg: "follow-up answer",
+			turnId: "turn-agent-2",
+			stepId: "step-agent-2",
+		},
+	],
+	toolCalls: [
+		{
+			id: "tool-call-search",
+			name: "search",
+			callType: "tool",
+		},
+		{
+			id: "tool-call-browser",
+			name: "browser.open",
+			callType: "tool",
 		},
 	],
+	plugins: {
+		"rag-compat": {
+			kind: "rag-compat",
+			version: "1.0",
+			data: {
+				turnIdentity: [
+					{ turnId: "turn-user-1", stepId: "step-user-1" },
+					{ turnId: "turn-agent-1", stepId: "step-agent-1" },
+					{ turnId: "turn-user-2", stepId: "step-user-2" },
+					{ turnId: "turn-agent-2", stepId: "step-agent-2" },
+				],
+				retrievals: {
+					"tool-call-search": {
+						candidates: [
+							{
+								url: "https://example.com/doc1",
+								title: "Doc one",
+								chunk: "Snippet one",
+								toolCallId: "tool-call-search",
+								turnId: "turn-agent-1",
+								bonus: false,
+							},
+						],
+					},
+					"tool-call-browser": {
+						candidates: [
+							{
+								url: "https://example.com/doc1",
+								title: "Doc one follow-up",
+								chunk: "Snippet duplicate",
+								toolCallId: "tool-call-browser",
+								turnId: "turn-agent-2",
+								bonus: false,
+							},
+						],
+					},
+				},
+			},
+		},
+	},
 	status: "draft",
 	tags: ["t1"],
+	contextEntries: undefined,
+	expectedTools: undefined,
 	datasetName: "ds",
 	bucket: "0",
 	_etag: "etag1",
 };
 
+function stripVisitedAtFromPlugins(plugins: unknown): unknown {
+	if (!plugins || typeof plugins !== "object") {
+		return plugins;
+	}
+	const nextPlugins = structuredClone(plugins) as Record<string, unknown>;
+	const ragCompat = nextPlugins["rag-compat"];
+	if (!ragCompat || typeof ragCompat !== "object") {
+		return nextPlugins;
+	}
+	const ragCompatRecord = ragCompat as {
+		data?: { retrievals?: Record<string, { candidates?: unknown[] }> };
+	};
+	const retrievals = ragCompatRecord.data?.retrievals;
+	if (!retrievals) {
+		return nextPlugins;
+	}
+	for (const bucket of Object.values(retrievals)) {
+		if (!Array.isArray(bucket?.candidates)) continue;
+		bucket.candidates = bucket.candidates.map((candidate) => {
+			if (!candidate || typeof candidate !== "object") return candidate;
+			const { visitedAt: _visitedAt, ...rest } = candidate as Record<
+				string,
+				unknown
+			>;
+			return rest;
+		});
+	}
+	return nextPlugins;
+}
+
 const getMyAssignmentsMock = vi.fn().mockResolvedValue([initialApiItem]);
 const updateAssignedGroundTruthMock = vi
 	.fn()
@@ -99,6 +203,29 @@ const updateAssignedGroundTruthMock = vi
 				id,
 				editedQuestion: patch.editedQuestion || initialApiItem.editedQuestion,
 				answer: patch.answer || initialApiItem.answer,
+				history:
+					Array.isArray(patch.history) && patch.history.length > 0
+						? patch.history
+						: initialApiItem.history,
+				toolCalls:
+					Array.isArray(patch.toolCalls) && patch.toolCalls.length > 0
+						? patch.toolCalls
+						: initialApiItem.toolCalls,
+				contextEntries: Array.isArray(patch.contextEntries)
+					? patch.contextEntries
+					: initialApiItem.contextEntries,
+				expectedTools:
+					patch.expectedTools && typeof patch.expectedTools === "object"
+						? patch.expectedTools
+						: initialApiItem.expectedTools,
+				plugins:
+					patch.plugins !== undefined
+						? stripVisitedAtFromPlugins(patch.plugins)
+						: initialApiItem.plugins,
+				status:
+					(typeof patch.status === "string"
+						? patch.status
+						: initialApiItem.status) || "draft",
 				refs: normalizedRefs,
 				_etag: "etag2",
 			};
@@ -114,6 +241,7 @@ vi.mock("../../../src/services/assignments", () => ({
 vi.mock("../../../src/services/groundTruths", () => ({
 	deleteGroundTruth: vi.fn(),
 	getGroundTruth: vi.fn(),
+	getGroundTruthRaw: vi.fn(),
 }));
 
 let useGroundTruth: typeof import("../../../src/hooks/useGroundTruth").default;
@@ -125,6 +253,10 @@ beforeAll(async () => {
 });
 
 describe("useGroundTruth visitedAt persistence on save (SA-232)", () => {
+	beforeEach(() => {
+		updateAssignedGroundTruthMock.mockClear();
+	});
+
 	it("preserves visitedAt after saving draft with answer change", async () => {
 		const { result } = renderHook(() => useGroundTruth());
 		await waitFor(() => {
@@ -132,18 +264,27 @@ describe("useGroundTruth visitedAt persistence on save (SA-232)", () => {
 		});
 		const current = result.current.current;
 		expect(current).toBeTruthy();
-		if (!current?.references?.[0]) {
+		if (!current) {
+			throw new Error("Expected current item");
+		}
+		const currentRefs = getItemReferences(current);
+		if (!currentRefs[0]) {
 			throw new Error("Expected at least one reference");
 		}
-		const ref = current.references[0];
+		const ref = currentRefs[0];
 		expect(ref.visitedAt).toBeFalsy();
 
 		// Mark visited via openReference
 		await act(async () => {
 			result.current.openReference(ref);
 		});
+		const afterOpenCurrent = result.current.current;
+		expect(afterOpenCurrent).toBeTruthy();
+		if (!afterOpenCurrent) {
+			throw new Error("Expected current item after opening reference");
+		}
 		const afterOpenVisitedAt =
-			result.current.current?.references?.[0]?.visitedAt;
+			getItemReferences(afterOpenCurrent)[0]?.visitedAt;
 		expect(afterOpenVisitedAt).toBeTruthy();
 
 		// Change answer so save is not a no-op
@@ -158,10 +299,167 @@ describe("useGroundTruth visitedAt persistence on save (SA-232)", () => {
 		});
 
 		// visitedAt should still be present
+		const afterSaveCurrent = result.current.current;
+		expect(afterSaveCurrent).toBeTruthy();
+		if (!afterSaveCurrent) {
+			throw new Error("Expected current item after save");
+		}
 		const afterSaveVisitedAt =
-			result.current.current?.references?.[0]?.visitedAt;
+			getItemReferences(afterSaveCurrent)[0]?.visitedAt;
 		expect(afterSaveVisitedAt).toBeTruthy();
 		// It should be exactly the same timestamp (we merge, not overwrite)
 		expect(afterSaveVisitedAt).toBe(afterOpenVisitedAt);
 	});
+
+	it("preserves distinct visitedAt values for duplicate URLs with different turn and tool ownership", async () => {
+		const { result } = renderHook(() => useGroundTruth());
+		await waitFor(() => {
+			expect(result.current.current).toBeTruthy();
+		});
+
+		const current = result.current.current;
+		expect(current).toBeTruthy();
+		if (!current) {
+			throw new Error("Expected current item");
+		}
+
+		const duplicateUrlRefs = getItemReferences(current).filter(
+			(ref) => ref.url === "https://example.com/doc1",
+		);
+		expect(duplicateUrlRefs).toHaveLength(2);
+
+		await act(async () => {
+			result.current.updateReference(duplicateUrlRefs[0].id, {
+				visitedAt: "2026-03-13T10:00:00.000Z",
+			});
+			result.current.updateReference(duplicateUrlRefs[1].id, {
+				visitedAt: "2026-03-13T10:05:00.000Z",
+			});
+			result.current.updateAnswer("updated answer for duplicate URL merge");
+		});
+
+		await act(async () => {
+			const saveResult = await result.current.save();
+			expect(saveResult.ok).toBe(true);
+		});
+
+		const afterSaveCurrent = result.current.current;
+		expect(afterSaveCurrent).toBeTruthy();
+		if (!afterSaveCurrent) {
+			throw new Error("Expected current item after save");
+		}
+
+		const afterSaveRefs = getItemReferences(afterSaveCurrent).filter(
+			(ref) => ref.url === "https://example.com/doc1",
+		);
+		expect(afterSaveRefs).toHaveLength(2);
+		expect(
+			afterSaveRefs.find((ref) => ref.toolCallId === "tool-call-search")
+				?.visitedAt,
+		).toBe("2026-03-13T10:00:00.000Z");
+		expect(
+			afterSaveRefs.find((ref) => ref.toolCallId === "tool-call-browser")
+				?.visitedAt,
+		).toBe("2026-03-13T10:05:00.000Z");
+	});
+
+	it("keeps same-owner duplicate URL chunks distinct and restores each visitedAt after save", async () => {
+		const { result } = renderHook(() => useGroundTruth());
+		await waitFor(() => {
+			expect(result.current.current).toBeTruthy();
+		});
+
+		const current = result.current.current;
+		expect(current).toBeTruthy();
+		if (!current) {
+			throw new Error("Expected current item");
+		}
+
+		const baseRef = getItemReferences(current).find(
+			(ref) =>
+				ref.url === "https://example.com/doc1" &&
+				ref.toolCallId === "tool-call-search" &&
+				ref.turnId === "turn-agent-1",
+		);
+		expect(baseRef).toBeTruthy();
+		if (!baseRef) {
+			throw new Error("Expected baseline reference");
+		}
+
+		await act(async () => {
+			result.current.addReferences([
+				{
+					...baseRef,
+					id: "ref-same-owner-second-chunk",
+					snippet: "Snippet one - second chunk",
+					keyParagraph: "Second chunk paragraph",
+					visitedAt: null,
+				},
+			]);
+		});
+
+		const afterAddCurrent = result.current.current;
+		expect(afterAddCurrent).toBeTruthy();
+		if (!afterAddCurrent) {
+			throw new Error("Expected current item after adding duplicate chunk");
+		}
+
+		const sameOwnerRefs = getItemReferences(afterAddCurrent).filter(
+			(ref) =>
+				ref.url === "https://example.com/doc1" &&
+				ref.toolCallId === "tool-call-search" &&
+				ref.turnId === "turn-agent-1",
+		);
+		expect(sameOwnerRefs).toHaveLength(2);
+
+		const firstChunk = sameOwnerRefs.find(
+			(ref) => ref.snippet === "Snippet one",
+		);
+		const secondChunk = sameOwnerRefs.find(
+			(ref) => ref.snippet === "Snippet one - second chunk",
+		);
+		expect(firstChunk).toBeTruthy();
+		expect(secondChunk).toBeTruthy();
+		if (!firstChunk || !secondChunk) {
+			throw new Error("Expected both same-owner chunks");
+		}
+
+		await act(async () => {
+			result.current.updateReference(firstChunk.id, {
+				visitedAt: "2026-03-13T11:00:00.000Z",
+			});
+			result.current.updateReference(secondChunk.id, {
+				visitedAt: "2026-03-13T11:05:00.000Z",
+			});
+			result.current.updateAnswer(
+				"updated answer for same-owner duplicate chunk",
+			);
+		});
+
+		await act(async () => {
+			const saveResult = await result.current.save();
+			expect(saveResult.ok).toBe(true);
+		});
+
+		const afterSaveCurrent = result.current.current;
+		expect(afterSaveCurrent).toBeTruthy();
+		if (!afterSaveCurrent) {
+			throw new Error("Expected current item after save");
+		}
+
+		const afterSaveRefs = getItemReferences(afterSaveCurrent).filter(
+			(ref) =>
+				ref.url === "https://example.com/doc1" &&
+				ref.toolCallId === "tool-call-search" &&
+				ref.turnId === "turn-agent-1",
+		);
+		expect(afterSaveRefs).toHaveLength(2);
+		expect(
+			afterSaveRefs.find((ref) => ref.snippet === "Snippet one")?.visitedAt,
+		).toBe("2026-03-13T11:00:00.000Z");
+		expect(
+			afterSaveRefs.find((ref) => ref.snippet === "Snippet one - second chunk")
+				?.visitedAt,
+		).toBe("2026-03-13T11:05:00.000Z");
+	});
 });
diff --git a/frontend/tests/unit/hooks/useGroundTruthCache.test.ts b/frontend/tests/unit/hooks/useGroundTruthCache.test.ts
index c46a6a6..b40e770 100644
--- a/frontend/tests/unit/hooks/useGroundTruthCache.test.ts
+++ b/frontend/tests/unit/hooks/useGroundTruthCache.test.ts
@@ -11,9 +11,10 @@ describe("useGroundTruthCache", () => {
 		datasetName: "test-dataset",
 		bucket: "bucket-uuid",
 		status: "draft",
-		question: "Test question?",
-		answer: "Test answer",
-		references: [],
+		history: [
+			{ role: "user", content: "Test question?", turnId: "turn_1" },
+			{ role: "agent", content: "Test answer", turnId: "turn_2" },
+		],
 		manualTags: [],
 		computedTags: [],
 		providerId: "api",
diff --git a/frontend/tests/unit/models/groundTruth.multiturn.test.ts b/frontend/tests/unit/models/groundTruth.multiturn.test.ts
index 2f6b763..15546f0 100644
--- a/frontend/tests/unit/models/groundTruth.multiturn.test.ts
+++ b/frontend/tests/unit/models/groundTruth.multiturn.test.ts
@@ -17,7 +17,6 @@ describe("groundTruth multi-turn helpers", () => {
 		providerId: "demo",
 		question: "fallback question",
 		answer: "fallback answer",
-		references: [],
 		status: "draft",
 		...overrides,
 	});
@@ -49,14 +48,14 @@ describe("groundTruth multi-turn helpers", () => {
 	});
 
 	describe("last turn helpers", () => {
-		it("falls back to question when no user turns exist", () => {
+		it("returns empty compatibility question when canonical history has no user turn", () => {
 			const item = makeItem({ history: [{ role: "agent", content: "Agent" }] });
-			expect(getLastUserTurn(item)).toBe("fallback question");
+			expect(getLastUserTurn(item)).toBe("");
 		});
 
-		it("falls back to answer when no agent turns exist", () => {
+		it("returns empty compatibility answer when canonical history has no agent turn", () => {
 			const item = makeItem({ history: [{ role: "user", content: "User" }] });
-			expect(getLastAgentTurn(item)).toBe("fallback answer");
+			expect(getLastAgentTurn(item)).toBe("");
 		});
 
 		it("returns latest matching turn content", () => {
diff --git a/frontend/tests/unit/models/gtHelpers.expectedBehavior.test.ts b/frontend/tests/unit/models/gtHelpers.expectedBehavior.test.ts
index 3a22ab3..12bb707 100644
--- a/frontend/tests/unit/models/gtHelpers.expectedBehavior.test.ts
+++ b/frontend/tests/unit/models/gtHelpers.expectedBehavior.test.ts
@@ -1,10 +1,22 @@
 /**
  * Unit tests for expected behavior validation in gtHelpers
+ *
+ * NOTE (Phase 2 generic schema): canApproveMultiTurn no longer requires
+ * expectedBehavior on every agent turn. Approval is gated only on:
+ *   - valid conversation pattern (user/non-user alternating, ends on non-user)
+ *   - item not deleted
+ * These tests document the current generic-approval behavior.
  */
 
 import { describe, expect, it } from "vitest";
-import type { GroundTruthItem } from "../../../src/models/groundTruth";
-import { canApproveMultiTurn } from "../../../src/models/gtHelpers";
+import type {
+	ConversationTurn,
+	GroundTruthItem,
+} from "../../../src/models/groundTruth";
+import {
+	canApproveCandidate,
+	canApproveMultiTurn,
+} from "../../../src/models/gtHelpers";
 
 describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 	const baseItem: GroundTruthItem = {
@@ -13,7 +25,8 @@ describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 		question: "Test question",
 		answer: "Test answer",
 		status: "draft",
-		references: [],
+		expectedTools: { required: [{ name: "search" }] },
+		toolCalls: [{ id: "tc1", name: "search", callType: "tool" }],
 		history: [
 			{
 				role: "user",
@@ -32,7 +45,7 @@ describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 		expect(result).toBe(true);
 	});
 
-	it("should block approval when agent turn has no expected behavior", () => {
+	it("should allow approval when agent turn has no expected behavior (generic schema: not required)", () => {
 		const itemWithoutBehavior: GroundTruthItem = {
 			...baseItem,
 			history: [
@@ -43,16 +56,16 @@ describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 				{
 					role: "agent",
 					content: "Here is how you extrude a shape...",
-					// Missing expectedBehavior
+					// Missing expectedBehavior — no longer blocks approval
 				},
 			],
 		};
 
 		const result = canApproveMultiTurn(itemWithoutBehavior);
-		expect(result).toBe(false);
+		expect(result).toBe(true);
 	});
 
-	it("should block approval when agent turn has empty expected behavior array", () => {
+	it("should allow approval when agent turn has empty expected behavior array (generic schema: not required)", () => {
 		const itemWithEmptyBehavior: GroundTruthItem = {
 			...baseItem,
 			history: [
@@ -69,7 +82,7 @@ describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 		};
 
 		const result = canApproveMultiTurn(itemWithEmptyBehavior);
-		expect(result).toBe(false);
+		expect(result).toBe(true);
 	});
 
 	it("should allow approval when all multiple agent turns have expected behavior", () => {
@@ -101,7 +114,7 @@ describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 		expect(result).toBe(true);
 	});
 
-	it("should block approval when any agent turn is missing expected behavior", () => {
+	it("should allow approval when any agent turn is missing expected behavior (generic schema: not required)", () => {
 		const multiTurnItem: GroundTruthItem = {
 			...baseItem,
 			history: [
@@ -121,13 +134,13 @@ describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 				{
 					role: "agent",
 					content: "Here is how you extrude in a CAD application...",
-					// Missing expectedBehavior on second agent turn
+					// Missing expectedBehavior on second agent turn — no longer blocks
 				},
 			],
 		};
 
 		const result = canApproveMultiTurn(multiTurnItem);
-		expect(result).toBe(false);
+		expect(result).toBe(true);
 	});
 
 	it("should allow approval with single expected behavior", () => {
@@ -149,4 +162,235 @@ describe("canApproveMultiTurn - Expected Behavior Validation", () => {
 		const result = canApproveMultiTurn(itemWithSingleBehavior);
 		expect(result).toBe(true);
 	});
+
+	it("blocks multi-turn approval when required references are unvisited", () => {
+		const itemWithUnvisitedReferences: GroundTruthItem = {
+			...baseItem,
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1.0",
+					data: {
+						retrievals: {
+							_unassociated: {
+								candidates: [{ url: "https://example.com/trace" }],
+							},
+						},
+					},
+				},
+			},
+		};
+
+		const result = canApproveMultiTurn(itemWithUnvisitedReferences, {
+			requireReferenceVisit: true,
+			requireKeyParagraph: false,
+		});
+		expect(result).toBe(false);
+	});
+
+	it("blocks multi-turn approval when required key paragraphs are missing", () => {
+		const itemWithShortKeyParagraph: GroundTruthItem = {
+			...baseItem,
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1.0",
+					data: {
+						retrievals: {
+							_unassociated: {
+								candidates: [
+									{
+										url: "https://example.com/trace",
+										visitedAt: "2026-03-13T12:00:00Z",
+										keyParagraph: "Too short to satisfy the requirement.",
+									},
+								],
+							},
+						},
+					},
+				},
+			},
+		};
+
+		const result = canApproveMultiTurn(itemWithShortKeyParagraph, {
+			requireReferenceVisit: true,
+			requireKeyParagraph: true,
+		});
+		expect(result).toBe(false);
+	});
+
+	it("allows multi-turn approval when reference requirements are disabled", () => {
+		const itemWithUnvisitedReferences: GroundTruthItem = {
+			...baseItem,
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1.0",
+					data: {
+						retrievals: {
+							_unassociated: {
+								candidates: [{ url: "https://example.com/trace" }],
+							},
+						},
+					},
+				},
+			},
+		};
+
+		const result = canApproveMultiTurn(itemWithUnvisitedReferences, {
+			requireReferenceVisit: false,
+			requireKeyParagraph: false,
+		});
+		expect(result).toBe(true);
+	});
+
+	it("threads reference requirements through canApproveCandidate for multi-turn items", () => {
+		const itemWithUnvisitedReferences: GroundTruthItem = {
+			...baseItem,
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1.0",
+					data: {
+						retrievals: {
+							_unassociated: {
+								candidates: [{ url: "https://example.com/trace" }],
+							},
+						},
+					},
+				},
+			},
+		};
+
+		const result = canApproveCandidate(itemWithUnvisitedReferences, {
+			requireReferenceVisit: true,
+			requireKeyParagraph: false,
+		});
+		expect(result).toBe(false);
+	});
+
+	it("allows multi-turn approval when required references are visited and annotated", () => {
+		const itemWithReadyReferences: GroundTruthItem = {
+			...baseItem,
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1.0",
+					data: {
+						retrievals: {
+							_unassociated: {
+								candidates: [
+									{
+										url: "https://example.com/trace",
+										visitedAt: "2026-03-13T12:00:00Z",
+										keyParagraph:
+											"This key paragraph is comfortably longer than forty characters.",
+									},
+								],
+							},
+						},
+					},
+				},
+			},
+		};
+
+		const result = canApproveCandidate(itemWithReadyReferences, {
+			requireReferenceVisit: true,
+			requireKeyParagraph: true,
+		});
+		expect(result).toBe(true);
+	});
+});
+
+// ---------------------------------------------------------------------------
+// canApproveMultiTurn – expectedTools gating (Phase 4)
+// ---------------------------------------------------------------------------
+
+describe("canApproveMultiTurn - expectedTools gating", () => {
+	const localBaseItem: GroundTruthItem = {
+		id: "test-et",
+		providerId: "test",
+		question: "Test question",
+		answer: "Test answer",
+		status: "draft",
+	};
+	const validHistory: ConversationTurn[] = [
+		{ role: "user", content: "q" },
+		{ role: "agent", content: "a" },
+	];
+
+	it("blocks approval when no expectedTools are defined (≥1 required tool gate)", () => {
+		const item: GroundTruthItem = {
+			...localBaseItem,
+			history: validHistory,
+		};
+		expect(canApproveMultiTurn(item)).toBe(false);
+	});
+
+	it("approves when all required tools are present in toolCalls", () => {
+		const item: GroundTruthItem = {
+			...localBaseItem,
+			history: validHistory,
+			expectedTools: { required: [{ name: "search" }] },
+			toolCalls: [{ id: "1", name: "search", callType: "tool" }],
+		};
+		expect(canApproveMultiTurn(item)).toBe(true);
+	});
+
+	it("blocks approval when a required tool is missing from toolCalls", () => {
+		const item: GroundTruthItem = {
+			...localBaseItem,
+			history: validHistory,
+			expectedTools: { required: [{ name: "search" }] },
+			toolCalls: [],
+		};
+		expect(canApproveMultiTurn(item)).toBe(false);
+	});
+
+	it("blocks approval when only optional or notNeeded tools exist (no required)", () => {
+		const item: GroundTruthItem = {
+			...localBaseItem,
+			history: validHistory,
+			expectedTools: {
+				optional: [{ name: "summarize" }],
+				notNeeded: [{ name: "rerank" }],
+			},
+			toolCalls: [],
+		};
+		expect(canApproveMultiTurn(item)).toBe(false);
+	});
+
+	it("allows plugin bypass of required-tools gate", () => {
+		const item: GroundTruthItem = {
+			...localBaseItem,
+			history: validHistory,
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1",
+					data: { canBypassRequiredTools: true },
+				},
+			},
+		};
+		expect(canApproveMultiTurn(item)).toBe(true);
+	});
+
+	it("plugin bypass does not skip validation of required tools actually defined", () => {
+		const item: GroundTruthItem = {
+			...localBaseItem,
+			history: validHistory,
+			expectedTools: { required: [{ name: "search" }] },
+			toolCalls: [],
+			plugins: {
+				"rag-compat": {
+					kind: "rag-compat",
+					version: "1",
+					data: { canBypassRequiredTools: true },
+				},
+			},
+		};
+		// Bypass allows the "≥1 required" gate, but validateExpectedTools still blocks
+		// because the required "search" tool is missing from toolCalls
+		expect(canApproveMultiTurn(item)).toBe(false);
+	});
 });
diff --git a/frontend/tests/unit/models/gtHelpers.referenceIdentity.test.ts b/frontend/tests/unit/models/gtHelpers.referenceIdentity.test.ts
new file mode 100644
index 0000000..2ed79bc
--- /dev/null
+++ b/frontend/tests/unit/models/gtHelpers.referenceIdentity.test.ts
@@ -0,0 +1,64 @@
+import { describe, expect, it } from "vitest";
+import type { Reference } from "../../../src/models/groundTruth";
+import { dedupeReferences } from "../../../src/models/gtHelpers";
+
+function makeReference(overrides: Partial<Reference> = {}): Reference {
+	return {
+		id: overrides.id ?? "ref-1",
+		url: overrides.url ?? "https://example.com/doc",
+		title: overrides.title,
+		snippet: overrides.snippet,
+		visitedAt: overrides.visitedAt ?? null,
+		keyParagraph: overrides.keyParagraph,
+		bonus: overrides.bonus ?? false,
+		messageIndex: overrides.messageIndex,
+		turnId: overrides.turnId ?? "turn-agent-1",
+		toolCallId: overrides.toolCallId ?? "tool-call-search",
+	};
+}
+
+describe("dedupeReferences chunk-aware identity", () => {
+	it("keeps same-url references distinct when snippet data differs within one tool/turn owner", () => {
+		const existing = [
+			makeReference({
+				id: "ref-existing",
+				snippet: "Chunk A",
+				keyParagraph: "Paragraph A",
+			}),
+		];
+		const chosen = [
+			makeReference({
+				id: "ref-chosen",
+				snippet: "Chunk B",
+				keyParagraph: "Paragraph B",
+			}),
+		];
+
+		const deduped = dedupeReferences(existing, chosen);
+
+		expect(deduped).toHaveLength(2);
+		expect(deduped.map((ref) => ref.snippet)).toEqual(["Chunk A", "Chunk B"]);
+	});
+
+	it("preserves legacy same-owner URL dedupe when chunk-level data is absent", () => {
+		const existing = [
+			makeReference({
+				id: "ref-existing",
+				snippet: undefined,
+				keyParagraph: undefined,
+			}),
+		];
+		const chosen = [
+			makeReference({
+				id: "ref-chosen",
+				snippet: undefined,
+				keyParagraph: undefined,
+			}),
+		];
+
+		const deduped = dedupeReferences(existing, chosen);
+
+		expect(deduped).toHaveLength(1);
+		expect(deduped[0]?.id).toBe("ref-existing");
+	});
+});
diff --git a/frontend/tests/unit/models/validators.test.ts b/frontend/tests/unit/models/validators.test.ts
index 8f0b3d1..56f4935 100644
--- a/frontend/tests/unit/models/validators.test.ts
+++ b/frontend/tests/unit/models/validators.test.ts
@@ -34,7 +34,7 @@ describe("validateConversationPattern", () => {
 		expect(result.errors).toHaveLength(0);
 	});
 
-	it("should reject conversation ending with user turn (incomplete)", () => {
+	it("should reject conversation ending with user turn", () => {
 		const history: ConversationTurn[] = [
 			{ role: "user", content: "What is the weather?" },
 			{ role: "agent", content: "It's sunny today." },
@@ -43,7 +43,7 @@ describe("validateConversationPattern", () => {
 		const result = validateConversationPattern(history);
 		expect(result.valid).toBe(false);
 		expect(result.errors).toContain(
-			"Conversation must end with an agent response (every user turn needs an agent response)",
+			"Conversation must end with an agent response",
 		);
 	});
 
@@ -61,49 +61,149 @@ describe("validateConversationPattern", () => {
 		expect(result.errors).toHaveLength(0);
 	});
 
-	it("should reject conversation with broken alternating pattern", () => {
+	it("should accept consecutive agent turns (agentic workflow)", () => {
 		const history: ConversationTurn[] = [
-			{ role: "user", content: "Question 1" },
-			{ role: "user", content: "Question 2" }, // Wrong - should be agent
-			{ role: "agent", content: "Answer" },
+			{ role: "user", content: "Why is my bill high?" },
+			{ role: "agent", content: "The usage spike came from streaming." },
+			{
+				role: "agent",
+				content: "Root cause: long streaming sessions on mobile data.",
+			},
 		];
 		const result = validateConversationPattern(history);
-		expect(result.valid).toBe(false);
-		expect(result.errors.length).toBeGreaterThan(0);
-		expect(
-			result.errors.some((e) => e.includes("Turn 2 should be a agent turn")),
-		).toBe(true);
+		expect(result.valid).toBe(true);
+		expect(result.errors).toHaveLength(0);
 	});
 
-	it("should reject conversation with consecutive agent turns", () => {
+	it("should accept multiple agent roles in sequence", () => {
 		const history: ConversationTurn[] = [
-			{ role: "user", content: "Question" },
-			{ role: "agent", content: "Answer 1" },
-			{ role: "agent", content: "Answer 2" }, // Wrong - should be user
+			{ role: "user", content: "Diagnose network issue" },
+			{ role: "orchestrator-agent", content: "Routing to diagnostics..." },
+			{ role: "output-agent", content: "Signal strength is low in your area." },
+			{ role: "agent", content: "Summary: tower congestion detected." },
 		];
 		const result = validateConversationPattern(history);
-		expect(result.valid).toBe(false);
-		expect(result.errors.length).toBeGreaterThan(0);
+		expect(result.valid).toBe(true);
+		expect(result.errors).toHaveLength(0);
 	});
 
-	it("should provide multiple errors when multiple violations exist", () => {
+	it("should reject conversation starting with agent even if multiple agents follow", () => {
 		const history: ConversationTurn[] = [
-			{ role: "agent", content: "Starting with agent" }, // Error 1: doesn't start with user
-			{ role: "agent", content: "Another agent" }, // Error 2: wrong pattern
+			{ role: "agent", content: "Starting with agent" },
+			{ role: "agent", content: "Another agent" },
 		];
 		const result = validateConversationPattern(history);
 		expect(result.valid).toBe(false);
-		expect(result.errors.length).toBeGreaterThan(1);
+		expect(result.errors).toContain("Conversation must start with a user turn");
 	});
 
-	it("should handle conversation with only one user turn (incomplete)", () => {
+	it("should reject consecutive user turns that end with user", () => {
 		const history: ConversationTurn[] = [
-			{ role: "user", content: "Just a question" },
+			{ role: "user", content: "Question 1" },
+			{ role: "user", content: "Question 2" },
 		];
 		const result = validateConversationPattern(history);
 		expect(result.valid).toBe(false);
 		expect(result.errors).toContain(
-			"Conversation must end with an agent response (every user turn needs an agent response)",
+			"Conversation must end with an agent response",
 		);
 	});
+
+	it("should accept consecutive user turns followed by agent", () => {
+		const history: ConversationTurn[] = [
+			{ role: "user", content: "Question 1" },
+			{ role: "user", content: "Wait, let me rephrase" },
+			{ role: "agent", content: "Here is my answer." },
+		];
+		const result = validateConversationPattern(history);
+		expect(result.valid).toBe(true);
+		expect(result.errors).toHaveLength(0);
+	});
+});
+
+// ---------------------------------------------------------------------------
+// validateExpectedTools
+// ---------------------------------------------------------------------------
+import type { GroundTruthItem } from "../../../src/models/groundTruth";
+import { validateExpectedTools } from "../../../src/models/validators";
+
+const baseItem: GroundTruthItem = {
+	id: "t1",
+	providerId: "test",
+	history: [
+		{ role: "user", content: "q", turnId: "turn_1" },
+		{ role: "agent", content: "a", turnId: "turn_2" },
+	],
+	status: "draft",
+};
+
+describe("validateExpectedTools", () => {
+	it("returns valid when no expectedTools defined", () => {
+		const result = validateExpectedTools({ ...baseItem });
+		expect(result.valid).toBe(true);
+		expect(result.missingRequired).toHaveLength(0);
+	});
+
+	it("returns valid when expectedTools.required is empty", () => {
+		const result = validateExpectedTools({
+			...baseItem,
+			expectedTools: { required: [] },
+		});
+		expect(result.valid).toBe(true);
+	});
+
+	it("returns valid when all required tools are in toolCalls", () => {
+		const result = validateExpectedTools({
+			...baseItem,
+			expectedTools: {
+				required: [{ name: "search" }, { name: "lookup" }],
+			},
+			toolCalls: [
+				{ id: "1", name: "search", callType: "tool" },
+				{ id: "2", name: "lookup", callType: "tool" },
+			],
+		});
+		expect(result.valid).toBe(true);
+		expect(result.missingRequired).toHaveLength(0);
+	});
+
+	it("returns invalid with missingRequired when a required tool is absent", () => {
+		const result = validateExpectedTools({
+			...baseItem,
+			expectedTools: {
+				required: [{ name: "search" }, { name: "lookup" }],
+			},
+			toolCalls: [{ id: "1", name: "search", callType: "tool" }],
+		});
+		expect(result.valid).toBe(false);
+		expect(result.missingRequired).toEqual(["lookup"]);
+		expect(result.errors).toHaveLength(1);
+		expect(result.errors[0]).toContain("lookup");
+	});
+
+	it("ignores optional and notNeeded in validation", () => {
+		const result = validateExpectedTools({
+			...baseItem,
+			expectedTools: {
+				required: [{ name: "search" }],
+				optional: [{ name: "summarize" }],
+				notNeeded: [{ name: "rerank" }],
+			},
+			toolCalls: [{ id: "1", name: "search", callType: "tool" }],
+		});
+		expect(result.valid).toBe(true);
+	});
+
+	it("returns multiple missing tools in errors array", () => {
+		const result = validateExpectedTools({
+			...baseItem,
+			expectedTools: {
+				required: [{ name: "toolA" }, { name: "toolB" }, { name: "toolC" }],
+			},
+			toolCalls: [],
+		});
+		expect(result.valid).toBe(false);
+		expect(result.missingRequired).toHaveLength(3);
+		expect(result.errors).toHaveLength(3);
+	});
 });
diff --git a/frontend/tests/unit/provider/duplicate-json.test.ts b/frontend/tests/unit/provider/duplicate-json.test.ts
index 8a458de..2b93f7a 100644
--- a/frontend/tests/unit/provider/duplicate-json.test.ts
+++ b/frontend/tests/unit/provider/duplicate-json.test.ts
@@ -1,5 +1,6 @@
 import { describe, expect, it } from "vitest";
 import { DEMO_JSON } from "../../../src/models/demoData";
+import { getItemReferences } from "../../../src/models/groundTruth";
 import { JsonProvider } from "../../../src/models/provider";
 
 describe("JsonProvider duplicate", () => {
@@ -16,7 +17,9 @@ describe("JsonProvider duplicate", () => {
 		// Core fields copied
 		expect(created.question).toBe(original.question);
 		expect(created.answer).toBe(original.answer);
-		expect(created.references?.length).toBe(original.references?.length);
+		expect(getItemReferences(created).length).toBe(
+			getItemReferences(original).length,
+		);
 		// Draft status and not deleted
 		expect(created.status).toBe("draft");
 		expect(created.deleted).toBe(false);
diff --git a/frontend/tests/unit/provider/provider.multiturn.test.ts b/frontend/tests/unit/provider/provider.multiturn.test.ts
index a905051..27687d4 100644
--- a/frontend/tests/unit/provider/provider.multiturn.test.ts
+++ b/frontend/tests/unit/provider/provider.multiturn.test.ts
@@ -5,6 +5,10 @@ import type {
 	GroundTruthItem,
 	Reference,
 } from "../../../src/models/groundTruth";
+import {
+	getItemReferences,
+	withUpdatedReferences,
+} from "../../../src/models/groundTruth";
 
 const {
 	mockGetMyAssignments,
@@ -29,7 +33,23 @@ vi.mock("../../../src/services/groundTruths", () => ({
 	getGroundTruth: mockGetGroundTruth,
 }));
 
-type ApiItem = components["schemas"]["GroundTruthItem-Output"];
+type ApiHistoryEntry = components["schemas"]["HistoryEntry"] & {
+	refs?: components["schemas"]["Reference"][];
+	expectedBehavior?: string[];
+};
+type ApiItem = Omit<
+	components["schemas"]["AgenticGroundTruthEntry-Output"],
+	"history"
+> & {
+	synthQuestion?: string | null;
+	editedQuestion?: string | null;
+	answer?: string | null;
+	refs?: components["schemas"]["Reference"][];
+	totalReferences?: number;
+	tags?: string[];
+	comment?: string | null;
+	history?: ApiHistoryEntry[];
+};
 
 type Patch = Partial<ApiItem>;
 
@@ -58,480 +78,279 @@ beforeEach(() => {
 	mockGetGroundTruth.mockReset();
 });
 
-describe("ApiProvider multi-turn mapping", () => {
-	it("list maps history roles and content", async () => {
-		const apiItem = makeApiItem({
-			history: [
-				{ role: "user", msg: "How do I?" },
-				{ role: "assistant", msg: "Use the regenerate command." },
-			],
-		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		expect(items).toHaveLength(1);
-		const history = items[0].history ?? [];
-		expect(history[0]).toMatchObject({ role: "user", content: "How do I?" });
-		expect(history[1]).toMatchObject({
-			role: "agent",
-			content: "Use the regenerate command.",
-		});
-	});
-
-	it("list maps per-turn refs with messageIndex", async () => {
-		const apiItem = makeApiItem({
-			history: [
-				{ role: "user", msg: "Q" },
-				{
-					role: "assistant",
-					msg: "A",
-					refs: [
-						{
-							url: "https://turn.ref",
-							content: "Snippet",
-							keyExcerpt: "Key",
-							bonus: true,
-						},
-					],
-				},
-			],
+describe("ApiProvider mapping", () => {
+	describe("core-generic multi-turn contracts", () => {
+		it("maps history roles, content, and stable turn identity", async () => {
+			const apiItem = makeApiItem({
+				history: [
+					{ role: "user", msg: "How do I?" },
+					{ role: "assistant", msg: "Use the regenerate command." },
+				],
+			});
+			mockGetMyAssignments.mockResolvedValue([apiItem]);
+			const provider = new ApiProvider();
+			const { items } = await provider.list();
+			const history = items[0].history ?? [];
+			expect(history).toHaveLength(2);
+			expect(history[0]).toMatchObject({ role: "user", content: "How do I?" });
+			expect(history[1]).toMatchObject({
+				role: "agent",
+				content: "Use the regenerate command.",
+			});
+			expect(history[0]?.turnId).toBeTruthy();
+			expect(history[1]?.turnId).toBeTruthy();
 		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		const refs = items[0].references;
-		const turnRef = refs.find((r) => r.messageIndex === 1);
-		expect(turnRef).toBeTruthy();
-		expect(turnRef?.url).toBe("https://turn.ref");
-		expect(turnRef?.bonus).toBe(true);
-	});
 
-	it("list includes top-level refs with messageIndex=1 for legacy items (Bug Fix: SA-86)", async () => {
-		// Legacy single-turn items have no history (or empty history array)
-		// Top-level refs should be assigned to the agent turn (messageIndex = 1)
-		const apiItem = makeApiItem({
-			refs: [
-				{
-					url: "https://top.ref",
-					content: "Top snippet",
-					keyExcerpt: "Top key",
-					bonus: false,
-				},
-			],
+		it("maps per-turn refs onto the owning non-user turn", async () => {
+			const apiItem = makeApiItem({
+				history: [
+					{ role: "user", msg: "Q" },
+					{
+						role: "assistant",
+						msg: "A",
+						refs: [
+							{
+								url: "https://turn.ref",
+								content: "Snippet",
+								keyExcerpt: "Key",
+								bonus: true,
+							},
+						],
+					},
+				],
+			});
+			mockGetMyAssignments.mockResolvedValue([apiItem]);
+			const provider = new ApiProvider();
+			const { items } = await provider.list();
+			const turn = items[0].history?.[1];
+			const [ref] = getItemReferences(items[0]);
+			expect(ref).toMatchObject({
+				url: "https://turn.ref",
+				bonus: true,
+				messageIndex: 1,
+				turnId: turn?.turnId,
+			});
 		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		const topRef = items[0].references.find(
-			(r: Reference) => r.url === "https://top.ref",
-		);
-		expect(topRef).toBeTruthy();
-		// Legacy items: refs assigned to agent turn
-		expect(topRef?.messageIndex).toBe(1);
 	});
 
-	it("list does not apply top-level tags to user turn when converting single-turn", async () => {
-		const apiItem = makeApiItem({
-			synthQuestion: "What is X?",
-			editedQuestion: "What is X exactly?",
-			answer: "X is Y",
-			tags: ["important", "technical"],
-			history: undefined, // No history = single-turn item
-		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		expect(items).toHaveLength(1);
-		const history = items[0].history ?? [];
-		expect(history).toHaveLength(2);
-		expect(history[0]).toMatchObject({
-			role: "user",
-			content: "What is X exactly?",
+	describe("compat-migration read projections", () => {
+		it("projects legacy single-turn payloads into stable user and agent turns", async () => {
+			const apiItem = makeApiItem({
+				synthQuestion: "What is X?",
+				editedQuestion: "What is X exactly?",
+				answer: "X is Y",
+				tags: ["important", "technical"],
+				history: undefined,
+			});
+			mockGetMyAssignments.mockResolvedValue([apiItem]);
+			const provider = new ApiProvider();
+			const { items } = await provider.list();
+			const history = items[0].history ?? [];
+			expect(history).toHaveLength(2);
+			expect(history[0]).toMatchObject({
+				role: "user",
+				content: "What is X exactly?",
+			});
+			expect(history[1]).toMatchObject({
+				role: "agent",
+				content: "X is Y",
+			});
 		});
-		expect(history[1]).toMatchObject({
-			role: "agent",
-			content: "X is Y",
-		});
-	});
 
-	it("list assigns refs to messageIndex 1 when converting single-turn with answer", async () => {
-		const apiItem = makeApiItem({
-			synthQuestion: "Question",
-			answer: "Answer",
-			refs: [
-				{
-					url: "https://example.com",
-					content: "content",
-					keyExcerpt: "key",
-					bonus: false,
-				},
-			],
-			history: undefined,
+		it("anchors legacy top-level refs to the synthesized agent turn even without an answer", async () => {
+			const apiItem = makeApiItem({
+				editedQuestion: "How do I configure authentication for my app?",
+				answer: "",
+				refs: [
+					{
+						url: "https://docs.example.com/auth",
+						content: "Authentication documentation content",
+						keyExcerpt: "Use OAuth 2.0 for authentication",
+						bonus: false,
+					},
+				],
+				history: undefined,
+			});
+			mockGetMyAssignments.mockResolvedValue([apiItem]);
+			const provider = new ApiProvider();
+			const { items } = await provider.list();
+			const history = items[0].history ?? [];
+			const [ref] = getItemReferences(items[0]);
+			expect(history).toHaveLength(2);
+			expect(history[0]?.content).toBe(
+				"How do I configure authentication for my app?",
+			);
+			expect(history[1]).toMatchObject({ role: "agent", content: "" });
+			expect(ref).toMatchObject({
+				url: "https://docs.example.com/auth",
+				messageIndex: 1,
+				turnId: history[1]?.turnId,
+			});
 		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		expect(items[0].references).toHaveLength(1);
-		expect(items[0].references[0].messageIndex).toBe(1);
 	});
+});
 
-	it("list assigns refs to messageIndex 1 even when answer is missing (Bug Fix: SA-86)", async () => {
-		const apiItem = makeApiItem({
-			synthQuestion: "Question",
-			answer: "",
-			refs: [
-				{
-					url: "https://example.com",
-					content: "content",
-					keyExcerpt: "key",
-					bonus: false,
+describe("ApiProvider serialization", () => {
+	describe("core-generic multi-turn writes", () => {
+		it("serializes history roles and keeps refs scoped to non-user turns", async () => {
+			const apiItem = makeApiItem({
+				history: [
+					{ role: "user", msg: "Original Q" },
+					{ role: "assistant", msg: "Original A" },
+				],
+			});
+			let capturedPatch: Patch | undefined;
+			mockUpdateAssignedGroundTruth.mockImplementation(
+				async (_dataset: string, _bucket: string, id: string, patch: Patch) => {
+					capturedPatch = patch;
+					return {
+						...apiItem,
+						id,
+						history: (patch.history as ApiItem["history"]) ?? apiItem.history,
+						refs: (patch.refs as ApiItem["refs"]) ?? apiItem.refs,
+						answer: (patch.answer as string) ?? apiItem.answer,
+						editedQuestion:
+							(patch.editedQuestion as string) ?? apiItem.editedQuestion,
+						status: (patch.status as ApiItem["status"]) ?? apiItem.status,
+					} as ApiItem;
 				},
-			],
-			history: undefined,
-		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		expect(items[0].references).toHaveLength(1);
-		expect(items[0].references[0].messageIndex).toBe(1);
-	});
-
-	it("list creates empty agent turn when question exists but answer is missing (Bug Fix: SA-86)", async () => {
-		const apiItem = makeApiItem({
-			synthQuestion: "Question without answer",
-			editedQuestion: undefined, // No edited question
-			answer: "",
-			history: undefined,
-		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		expect(items[0].history).toHaveLength(2);
-		expect(items[0].history?.[0]).toMatchObject({
-			role: "user",
-			content: "Question without answer",
-		});
-		expect(items[0].history?.[1]).toMatchObject({
-			role: "agent",
-			content: "",
-		});
-	});
-
-	it("list creates empty agent turn for null answer (Bug Fix: SA-86)", async () => {
-		const apiItem = makeApiItem({
-			synthQuestion: "Question",
-			answer: null as unknown as string,
-			history: undefined,
-		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		expect(items[0].history).toHaveLength(2);
-		expect(items[0].history?.[1]).toMatchObject({
-			role: "agent",
-			content: "",
-		});
-	});
-
-	it("list assigns refs to messageIndex 1 for question with refs but no answer (Bug Fix: SA-86)", async () => {
-		const apiItem = makeApiItem({
-			editedQuestion: "How do I configure authentication for my app?",
-			answer: "",
-			refs: [
+			);
+			mockGetMyAssignments.mockResolvedValue([apiItem]);
+			const provider = new ApiProvider();
+			const { items } = await provider.list();
+			const domain = items[0];
+			const history: NonNullable<GroundTruthItem["history"]> = [
+				{ role: "user", content: "Updated Q", turnId: "turn-user-updated" },
+				{ role: "agent", content: "Updated A", turnId: "turn-agent-updated" },
+			];
+			const refs: Reference[] = [
+				{ id: "global", url: "https://global" },
 				{
-					url: "https://docs.example.com/auth",
-					content: "Authentication documentation content",
-					keyExcerpt: "Use OAuth 2.0 for authentication",
-					bonus: false,
+					id: "turn",
+					url: "https://turn",
+					turnId: "turn-agent-updated",
 				},
 				{
-					url: "https://docs.example.com/config",
-					content: "Configuration guide",
-					bonus: false,
+					id: "user-ref",
+					url: "https://user",
+					turnId: "turn-user-updated",
 				},
-			],
-			history: undefined,
+			];
+			const updated: GroundTruthItem = withUpdatedReferences(
+				{ ...domain, history },
+				refs,
+			);
+			await provider.save(updated);
+			expect(capturedPatch).toBeDefined();
+			const patch = capturedPatch as Patch;
+			const patchHistory = patch.history as ApiItem["history"];
+			expect(patchHistory?.[0]?.role).toBe("user");
+			expect(patchHistory?.[1]?.role).toBe("assistant");
+			expect(patchHistory?.[0]?.refs).toBeUndefined();
+			expect(patchHistory?.[1]?.refs).toHaveLength(1);
+			expect(patchHistory?.[1]?.refs?.[0]?.url).toBe("https://turn");
 		});
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		expect(items[0].history).toHaveLength(2);
-		expect(items[0].history?.[0].content).toBe(
-			"How do I configure authentication for my app?",
-		);
-		expect(items[0].history?.[1].content).toBe("");
-		expect(items[0].references).toHaveLength(2);
-		expect(items[0].references[0].messageIndex).toBe(1);
-		expect(items[0].references[1].messageIndex).toBe(1);
-	});
-});
-
-describe("ApiProvider multi-turn serialization", () => {
-	it("save serializes history roles and agent refs", async () => {
-		const apiItem = makeApiItem({
-			history: [
-				{ role: "user", msg: "Original Q" },
-				{ role: "assistant", msg: "Original A" },
-			],
-		});
-		let capturedPatch: Patch | undefined;
-		mockUpdateAssignedGroundTruth.mockImplementation(
-			async (_dataset: string, _bucket: string, id: string, patch: Patch) => {
-				capturedPatch = patch;
-				return {
-					...apiItem,
-					id,
-					history: (patch.history as ApiItem["history"]) ?? apiItem.history,
-					refs: (patch.refs as ApiItem["refs"]) ?? apiItem.refs,
-					answer: (patch.answer as string) ?? apiItem.answer,
-					editedQuestion:
-						(patch.editedQuestion as string) ?? apiItem.editedQuestion,
-					status: (patch.status as ApiItem["status"]) ?? apiItem.status,
-				} as ApiItem;
-			},
-		);
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		const domain = items[0];
-		const history: NonNullable<GroundTruthItem["history"]> = [
-			{ role: "user", content: "Updated Q" },
-			{ role: "agent", content: "Updated A" },
-		];
-		const refs: Reference[] = [
-			{ id: "global", url: "https://global" },
-			{
-				id: "turn",
-				url: "https://turn",
-				messageIndex: 1,
-			},
-		];
-		const updated: GroundTruthItem = {
-			...domain,
-			history,
-			references: refs,
-		};
-		await provider.save(updated);
-		expect(capturedPatch).toBeDefined();
-		const patch = capturedPatch as Patch;
-		const patchHistory = patch.history as ApiItem["history"];
-		expect(patchHistory?.[0]?.role).toBe("user");
-		expect(patchHistory?.[1]?.role).toBe("assistant");
-		const agentRefs = patchHistory?.[1]?.refs;
-		expect(agentRefs).toHaveLength(1);
-		expect(agentRefs?.[0]?.url).toBe("https://turn");
-		const userRefs = patchHistory?.[0]?.refs;
-		expect(userRefs).toBeUndefined();
-	});
-
-	it("save omits agent refs when none provided", async () => {
-		const apiItem = makeApiItem({
-			history: [
-				{ role: "user", msg: "Original Q" },
-				{ role: "assistant", msg: "Original A" },
-			],
-		});
-		let capturedPatch: Patch | undefined;
-		mockUpdateAssignedGroundTruth.mockImplementation(
-			async (_dataset: string, _bucket: string, _id: string, patch: Patch) => {
-				capturedPatch = patch;
-				return apiItem;
-			},
-		);
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		const domain = items[0];
-		const history: NonNullable<GroundTruthItem["history"]> = [
-			{ role: "user", content: "Updated Q" },
-			{ role: "agent", content: "Updated A" },
-		];
-		const refs: Reference[] = [{ id: "global", url: "https://global" }];
-		const updated: GroundTruthItem = {
-			...domain,
-			history,
-			references: refs,
-		};
-		await provider.save(updated);
-		expect(capturedPatch).toBeDefined();
-		const patch = capturedPatch as Patch;
-		const patchHistory = patch.history as ApiItem["history"];
-		expect(patchHistory?.[1]?.refs).toBeUndefined();
-	});
-
-	it("save excludes refs for user turns even if provided", async () => {
-		const apiItem = makeApiItem({
-			history: [
-				{ role: "user", msg: "Original Q" },
-				{ role: "assistant", msg: "Original A" },
-			],
-		});
-		let capturedPatch: Patch | undefined;
-		mockUpdateAssignedGroundTruth.mockImplementation(
-			async (_dataset: string, _bucket: string, id: string, patch: Patch) => {
-				capturedPatch = patch;
-				void id;
-				return apiItem;
-			},
-		);
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-		const domain = items[0];
-		const history: NonNullable<GroundTruthItem["history"]> = [
-			{ role: "user", content: "Updated Q" },
-			{ role: "agent", content: "Updated A" },
-		];
-		const refs: Reference[] = [
-			{
-				id: "user-ref",
-				url: "https://user",
-				messageIndex: 0,
-			},
-			{
-				id: "agent-ref",
-				url: "https://agent",
-				messageIndex: 1,
-			},
-		];
-		const updated: GroundTruthItem = {
-			...domain,
-			history,
-			references: refs,
-		};
-		await provider.save(updated);
-		expect(capturedPatch).toBeDefined();
-		const patch = capturedPatch as Patch;
-		const patchHistory = patch.history as ApiItem["history"];
-		expect(patchHistory?.[0]?.refs).toBeUndefined();
-		expect(patchHistory?.[1]?.refs).toHaveLength(1);
-		expect(patchHistory?.[1]?.refs?.[0]?.url).toBe("https://agent");
-	});
 
-	it("save preserves top-level refs for legacy single-turn items (SA-86 bug fix)", async () => {
-		// Regression test for bug where top-level refs were wiped on save
-		// Legacy single-turn items have refs at top-level (no history).
-		// When loaded, fromApi() converts them to multi-turn and assigns messageIndex=1.
-		// When saved, toPatch() must save them back to top-level to prevent data loss.
-		const apiItem = makeApiItem({
-			synthQuestion: "What is X?",
-			answer: "X is Y",
-			refs: [
-				{
-					url: "https://legacy.ref/doc1",
-					content: "Legacy content",
-					keyExcerpt: "Key paragraph",
-					bonus: false,
-				},
-				{
-					url: "https://legacy.ref/doc2",
-					content: "Bonus content",
-					bonus: true,
+		it("keeps true multi-turn refs out of top-level compatibility fields", async () => {
+			const apiItem = makeApiItem({
+				history: [
+					{ role: "user", msg: "Question" },
+					{
+						role: "assistant",
+						msg: "Answer",
+						refs: [
+							{
+								url: "https://turn.ref",
+								content: "Turn content",
+								bonus: false,
+							},
+						],
+					},
+				],
+				refs: [],
+			});
+			let capturedPatch: Patch | undefined;
+			mockUpdateAssignedGroundTruth.mockImplementation(
+				async (
+					_dataset: string,
+					_bucket: string,
+					_id: string,
+					patch: Patch,
+				) => {
+					capturedPatch = patch;
+					return apiItem;
 				},
-			],
-			history: undefined, // No history = legacy single-turn item
+			);
+			mockGetMyAssignments.mockResolvedValue([apiItem]);
+			const provider = new ApiProvider();
+			const { items } = await provider.list();
+			await provider.save(items[0]);
+			const patch = capturedPatch as Patch;
+			const patchHistory = patch.history as ApiItem["history"];
+			expect(patch.refs).toHaveLength(0);
+			expect(patchHistory?.[1]?.refs).toHaveLength(1);
+			expect(patchHistory?.[1]?.refs?.[0]?.url).toBe("https://turn.ref");
 		});
-
-		let capturedPatch: Patch | undefined;
-		mockUpdateAssignedGroundTruth.mockImplementation(
-			async (_dataset: string, _bucket: string, id: string, patch: Patch) => {
-				capturedPatch = patch;
-				return {
-					...apiItem,
-					id,
-					refs: (patch.refs as ApiItem["refs"]) ?? apiItem.refs,
-					status: (patch.status as ApiItem["status"]) ?? apiItem.status,
-				} as ApiItem;
-			},
-		);
-
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-
-		// Verify fromApi() assigned messageIndex=1 to legacy refs
-		expect(items[0].references).toHaveLength(2);
-		expect(items[0].references[0].messageIndex).toBe(1);
-		expect(items[0].references[1].messageIndex).toBe(1);
-
-		// User might edit the item (e.g., change bonus flag, add key paragraph)
-		const updated: GroundTruthItem = {
-			...items[0],
-			references: items[0].references.map((r: Reference) =>
-				r.url === "https://legacy.ref/doc1"
-					? { ...r, bonus: true, keyParagraph: "Updated key" }
-					: r,
-			),
-		};
-
-		// Save the item
-		await provider.save(updated);
-
-		// Verify the patch preserves top-level refs (not wiped out)
-		expect(capturedPatch).toBeDefined();
-		const patch = capturedPatch as Patch;
-		expect(patch.refs).toBeDefined();
-		expect(patch.refs).toHaveLength(2);
-
-		// Verify refs are saved to top-level (backward compatible)
-		expect(patch.refs?.[0]?.url).toBe("https://legacy.ref/doc1");
-		expect(patch.refs?.[0]?.bonus).toBe(true); // User edit preserved
-		expect(patch.refs?.[0]?.keyExcerpt).toBe("Updated key"); // User edit preserved
-		expect(patch.refs?.[1]?.url).toBe("https://legacy.ref/doc2");
-		expect(patch.refs?.[1]?.bonus).toBe(true);
-
-		// Verify refs are ALSO in history[1] for multi-turn compatibility
-		const patchHistory = patch.history as ApiItem["history"];
-		expect(patchHistory).toBeDefined();
-		expect(patchHistory?.[1]?.refs).toBeDefined();
-		expect(patchHistory?.[1]?.refs).toHaveLength(2);
 	});
 
-	it("save does not include turn refs in top-level for true multi-turn items", async () => {
-		// True multi-turn items (created with history) should NOT have turn refs
-		// saved to top-level, only to history[].refs
-		const apiItem = makeApiItem({
-			history: [
-				{ role: "user", msg: "Question" },
-				{
-					role: "assistant",
-					msg: "Answer",
-					refs: [
-						{
-							url: "https://turn.ref",
-							content: "Turn content",
-							bonus: false,
-						},
-					],
+	describe("compat-migration write projections", () => {
+		it("preserves legacy top-level refs when saving a synthesized single-turn item", async () => {
+			const apiItem = makeApiItem({
+				synthQuestion: "What is X?",
+				answer: "X is Y",
+				refs: [
+					{
+						url: "https://legacy.ref/doc1",
+						content: "Legacy content",
+						keyExcerpt: "Key paragraph",
+						bonus: false,
+					},
+					{
+						url: "https://legacy.ref/doc2",
+						content: "Bonus content",
+						bonus: true,
+					},
+				],
+				history: undefined,
+			});
+			let capturedPatch: Patch | undefined;
+			mockUpdateAssignedGroundTruth.mockImplementation(
+				async (_dataset: string, _bucket: string, id: string, patch: Patch) => {
+					capturedPatch = patch;
+					return {
+						...apiItem,
+						id,
+						refs: (patch.refs as ApiItem["refs"]) ?? apiItem.refs,
+						status: (patch.status as ApiItem["status"]) ?? apiItem.status,
+					} as ApiItem;
 				},
-			],
-			refs: [], // No top-level refs
+			);
+			mockGetMyAssignments.mockResolvedValue([apiItem]);
+			const provider = new ApiProvider();
+			const { items } = await provider.list();
+			const legacyRefs = getItemReferences(items[0]);
+			const updated: GroundTruthItem = withUpdatedReferences(
+				items[0],
+				legacyRefs.map((ref) =>
+					ref.url === "https://legacy.ref/doc1"
+						? { ...ref, bonus: true, keyParagraph: "Updated key" }
+						: ref,
+				),
+			);
+			await provider.save(updated);
+			const patch = capturedPatch as Patch;
+			const patchHistory = patch.history as ApiItem["history"];
+			expect(patch.refs).toHaveLength(2);
+			expect(patch.refs?.[0]).toMatchObject({
+				url: "https://legacy.ref/doc1",
+				bonus: true,
+				keyExcerpt: "Updated key",
+			});
+			expect(patch.refs?.[1]).toMatchObject({
+				url: "https://legacy.ref/doc2",
+				bonus: true,
+			});
+			expect(patchHistory?.[1]?.refs).toHaveLength(2);
 		});
-
-		let capturedPatch: Patch | undefined;
-		mockUpdateAssignedGroundTruth.mockImplementation(
-			async (_dataset: string, _bucket: string, _id: string, patch: Patch) => {
-				capturedPatch = patch;
-				return apiItem;
-			},
-		);
-
-		mockGetMyAssignments.mockResolvedValue([apiItem]);
-		const provider = new ApiProvider();
-		const { items } = await provider.list();
-
-		// Save the item unchanged
-		await provider.save(items[0]);
-
-		expect(capturedPatch).toBeDefined();
-		const patch = capturedPatch as Patch;
-
-		// Top-level refs should be empty
-		expect(patch.refs).toHaveLength(0);
-
-		// Refs should be in history[1]
-		const patchHistory = patch.history as ApiItem["history"];
-		expect(patchHistory?.[1]?.refs).toHaveLength(1);
-		expect(patchHistory?.[1]?.refs?.[0]?.url).toBe("https://turn.ref");
 	});
 });
diff --git a/frontend/tests/unit/registry/FieldComponentRegistry.test.tsx b/frontend/tests/unit/registry/FieldComponentRegistry.test.tsx
new file mode 100644
index 0000000..606e506
--- /dev/null
+++ b/frontend/tests/unit/registry/FieldComponentRegistry.test.tsx
@@ -0,0 +1,195 @@
+import { beforeEach, describe, expect, it, vi } from "vitest";
+import type { ToolCallRecord } from "../../../src/models/groundTruth";
+import {
+	ToolCallExtensions,
+	toolCallDiscriminator,
+} from "../../../src/registry/FieldComponentRegistry";
+import type {
+	ToolCallActionProps,
+	ToolCallExtensionRegistration,
+} from "../../../src/registry/types";
+
+// ---------------------------------------------------------------------------
+// Helpers
+// ---------------------------------------------------------------------------
+
+function StubAction(_props: ToolCallActionProps) {
+	return <div>stub</div>;
+}
+
+function makeTc(
+	name: string,
+	overrides?: Partial<ToolCallRecord>,
+): ToolCallRecord {
+	return { id: "tc-1", name, callType: "tool", ...overrides };
+}
+
+// ---------------------------------------------------------------------------
+// Tests
+// ---------------------------------------------------------------------------
+
+describe("ToolCallExtensions", () => {
+	let registry: ToolCallExtensions;
+
+	beforeEach(() => {
+		registry = new ToolCallExtensions();
+	});
+
+	// ── register / resolveAll ───────────────────────────────────────────────
+
+	it("registers and resolves by exact discriminator", () => {
+		registry.register({
+			discriminator: "toolCall:search",
+			component: StubAction,
+			displayName: "Search",
+		});
+
+		const matches = registry.resolveAll(makeTc("search"));
+		expect(matches).toHaveLength(1);
+		expect(matches[0].component).toBe(StubAction);
+	});
+
+	// ── prefix matching ─────────────────────────────────────────────────────
+
+	it("resolves via prefix fallback (toolCall matches toolCall:retrieval)", () => {
+		registry.register({
+			discriminator: "toolCall",
+			component: StubAction,
+			displayName: "Catch-all",
+		});
+
+		const matches = registry.resolveAll(makeTc("retrieval"));
+		expect(matches).toHaveLength(1);
+		expect(matches[0].displayName).toBe("Catch-all");
+	});
+
+	it("does not prefix-match without colon separator", () => {
+		registry.register({
+			discriminator: "tool",
+			component: StubAction,
+			displayName: "Tool",
+		});
+
+		// "toolCall:foo" starts with "tool" but "tool" is not followed by ":"
+		const matches = registry.resolveAll(makeTc("foo"));
+		expect(matches).toHaveLength(0);
+	});
+
+	// ── matches predicate ───────────────────────────────────────────────────
+
+	it("filters by matches predicate when provided", () => {
+		registry.register({
+			discriminator: "toolCall:search",
+			component: StubAction,
+			displayName: "Search with args",
+			matches: (tc) => tc.arguments?.query !== undefined,
+		});
+
+		expect(registry.resolveAll(makeTc("search"))).toHaveLength(0);
+		expect(
+			registry.resolveAll(makeTc("search", { arguments: { query: "hello" } })),
+		).toHaveLength(1);
+	});
+
+	// ── no match ────────────────────────────────────────────────────────────
+
+	it("returns empty for unknown discriminator", () => {
+		expect(registry.resolveAll(makeTc("unknown"))).toHaveLength(0);
+	});
+
+	// ── duplicate registration warning ──────────────────────────────────────
+
+	it("logs warning on duplicate registration in dev mode", () => {
+		const spy = vi.spyOn(console, "warn").mockImplementation(() => {});
+
+		registry.register({
+			discriminator: "dup",
+			component: StubAction,
+			displayName: "First",
+		});
+		registry.register({
+			discriminator: "dup",
+			component: StubAction,
+			displayName: "Second",
+		});
+
+		expect(spy).toHaveBeenCalledWith(
+			"[ToolCallExtensions] Replacing registration for discriminator: dup",
+		);
+
+		spy.mockRestore();
+	});
+
+	// ── hasMatch() ──────────────────────────────────────────────────────────
+
+	it("hasMatch() returns true for matching tool call", () => {
+		registry.register({
+			discriminator: "toolCall:search",
+			component: StubAction,
+			displayName: "Search",
+		});
+
+		expect(registry.hasMatch(makeTc("search"))).toBe(true);
+	});
+
+	it("hasMatch() returns true for prefix match", () => {
+		registry.register({
+			discriminator: "toolCall",
+			component: StubAction,
+			displayName: "Catch-all",
+		});
+
+		expect(registry.hasMatch(makeTc("retrieval"))).toBe(true);
+	});
+
+	it("hasMatch() returns false for unknown tool call", () => {
+		expect(registry.hasMatch(makeTc("nope"))).toBe(false);
+	});
+
+	// ── registrations() ─────────────────────────────────────────────────────
+
+	it("registrations() returns all registered items", () => {
+		const regA: ToolCallExtensionRegistration = {
+			discriminator: "toolCall:a",
+			component: StubAction,
+			displayName: "A",
+		};
+		const regB: ToolCallExtensionRegistration = {
+			discriminator: "toolCall:b",
+			component: StubAction,
+			displayName: "B",
+		};
+
+		registry.register(regA);
+		registry.register(regB);
+
+		const all = registry.registrations();
+		expect(all).toHaveLength(2);
+		expect(all).toContainEqual(regA);
+		expect(all).toContainEqual(regB);
+	});
+
+	// ── reset() ─────────────────────────────────────────────────────────────
+
+	it("reset() clears all registrations", () => {
+		registry.register({
+			discriminator: "toolCall:search",
+			component: StubAction,
+			displayName: "Search",
+		});
+
+		registry.reset();
+
+		expect(registry.registrations()).toHaveLength(0);
+		expect(registry.hasMatch(makeTc("search"))).toBe(false);
+	});
+
+	// ── toolCallDiscriminator ───────────────────────────────────────────────
+
+	it("toolCallDiscriminator builds correct string", () => {
+		expect(toolCallDiscriminator(makeTc("search"))).toBe("toolCall:search");
+		expect(toolCallDiscriminator(makeTc("retrieval"))).toBe(
+			"toolCall:retrieval",
+		);
+	});
+});
diff --git a/frontend/tests/unit/registry/RegistryRenderer.test.tsx b/frontend/tests/unit/registry/RegistryRenderer.test.tsx
new file mode 100644
index 0000000..1a01f2b
--- /dev/null
+++ b/frontend/tests/unit/registry/RegistryRenderer.test.tsx
@@ -0,0 +1,77 @@
+import { render, screen } from "@testing-library/react";
+import { beforeEach, describe, expect, it } from "vitest";
+import type { ToolCallRecord } from "../../../src/models/groundTruth";
+import { toolCallExtensions } from "../../../src/registry/FieldComponentRegistry";
+import { ToolCallExtensionRenderer } from "../../../src/registry/RegistryRenderer";
+import type { ToolCallActionProps } from "../../../src/registry/types";
+
+// ---------------------------------------------------------------------------
+// Helpers
+// ---------------------------------------------------------------------------
+
+function MockAction({ toolCall }: ToolCallActionProps) {
+	return <div data-testid="mock-action">Action for {toolCall.name}</div>;
+}
+
+function makeTc(name: string): ToolCallRecord {
+	return { id: "tc-1", name, callType: "tool" };
+}
+
+function renderExtension(toolCall: ToolCallRecord) {
+	return render(
+		<ToolCallExtensionRenderer
+			toolCall={toolCall}
+			context={{
+				item: {
+					id: "item-1",
+					question: "q",
+					answer: "",
+					status: "draft",
+					providerId: "json",
+					tags: [],
+				},
+				readOnly: false,
+			}}
+			references={[]}
+		/>,
+	);
+}
+
+// ---------------------------------------------------------------------------
+// Tests
+// ---------------------------------------------------------------------------
+
+describe("ToolCallExtensionRenderer", () => {
+	beforeEach(() => {
+		toolCallExtensions.reset();
+	});
+
+	it("renders nothing when no extensions match", () => {
+		const { container } = renderExtension(makeTc("unknown"));
+		expect(container.innerHTML).toBe("");
+	});
+
+	it("renders the registered component when a matching tool call is provided", () => {
+		toolCallExtensions.register({
+			discriminator: "toolCall:search",
+			component: MockAction,
+			displayName: "Search Action",
+		});
+
+		renderExtension(makeTc("search"));
+
+		expect(screen.getByTestId("mock-action")).toBeInTheDocument();
+		expect(screen.getByTestId("mock-action").textContent).toContain("search");
+	});
+
+	it("renders nothing when discriminator does not match", () => {
+		toolCallExtensions.register({
+			discriminator: "toolCall:search",
+			component: MockAction,
+			displayName: "Search Action",
+		});
+
+		const { container } = renderExtension(makeTc("other"));
+		expect(container.innerHTML).toBe("");
+	});
+});
diff --git a/frontend/tests/unit/services/groundTruths-mapping.test.ts b/frontend/tests/unit/services/groundTruths-mapping.test.ts
index a155917..389a38e 100644
--- a/frontend/tests/unit/services/groundTruths-mapping.test.ts
+++ b/frontend/tests/unit/services/groundTruths-mapping.test.ts
@@ -1,8 +1,26 @@
 import { describe, expect, it } from "vitest";
+import type { ApiGroundTruth } from "../../../src/adapters/apiMapper";
+import { groundTruthFromApi } from "../../../src/adapters/apiMapper";
 import type { components } from "../../../src/api/generated";
+import { getItemReferences } from "../../../src/models/groundTruth";
 import { mapGroundTruthFromApi } from "../../../src/services/groundTruths";
 
-type ApiItem = components["schemas"]["GroundTruthItem-Output"];
+type ApiItem = Omit<
+	components["schemas"]["AgenticGroundTruthEntry-Output"],
+	"history"
+> & {
+	synthQuestion?: string | null;
+	editedQuestion?: string | null;
+	answer?: string | null;
+	refs?: components["schemas"]["Reference"][];
+	totalReferences?: number;
+	tags?: string[];
+	comment?: string | null;
+	history?: (components["schemas"]["HistoryEntry"] & {
+		refs?: components["schemas"]["Reference"][];
+		expectedBehavior?: string[];
+	})[];
+};
 
 function makeApiItem(overrides: Partial<ApiItem> = {}): ApiItem {
 	return {
@@ -23,106 +41,75 @@ function makeApiItem(overrides: Partial<ApiItem> = {}): ApiItem {
 }
 
 describe("mapGroundTruthFromApi", () => {
-	describe("reference messageIndex assignment for single-turn conversion", () => {
-		it("assigns refs to messageIndex 1 when converting single-turn with answer", () => {
+	describe("core-generic mapping", () => {
+		it("converts assistant role to agent and keeps stable turn ids", () => {
 			const apiItem = makeApiItem({
-				synthQuestion: "Question",
-				answer: "Answer",
-				refs: [
-					{
-						url: "https://example.com",
-						content: "content",
-						keyExcerpt: "key",
-						bonus: false,
-					},
-				],
-				history: undefined,
-			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
-			expect(result.references).toHaveLength(1);
-			expect(result.references[0].messageIndex).toBe(1); // Agent turn index
-		});
-
-		it("assigns refs to messageIndex 1 even when no answer exists (Bug Fix: SA-86)", () => {
-			// Bug 3: For Questions Without an Answer, Agent Turn Doesn't Get Created and UI Doesn't Show Existing Refs
-			// Fix ensures that refs are assigned to agent turn (messageIndex = 1) even when answer is empty
-			const apiItem = makeApiItem({
-				synthQuestion: "Question",
-				answer: "",
-				refs: [
-					{
-						url: "https://example.com",
-						content: "content",
-						bonus: false,
-					},
+				history: [
+					{ role: "user", msg: "Question" },
+					{ role: "assistant", msg: "Answer" },
 				],
-				history: undefined,
-			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
-			// References should be assigned to agent turn (messageIndex = 1)
-			expect(result.references[0].messageIndex).toBe(1);
-		});
-
-		it("creates empty agent turn when question exists but answer is missing (Bug Fix: SA-86)", () => {
-			// Bug 3: Ensures agent turn is created even without an answer
-			const apiItem = makeApiItem({
-				synthQuestion: "Question without answer",
-				answer: "",
-				history: undefined,
 			});
-
 			const result = mapGroundTruthFromApi(apiItem);
-
-			// Should create both user and agent turns
-			expect(result.history).toHaveLength(2);
 			expect(result.history?.[0]).toMatchObject({
 				role: "user",
-				content: "Question without answer",
+				content: "Question",
 			});
 			expect(result.history?.[1]).toMatchObject({
 				role: "agent",
-				content: "", // Empty agent content
+				content: "Answer",
 			});
+			expect(result.history?.[0].turnId).toBeTruthy();
+			expect(result.history?.[1].turnId).toBeTruthy();
 		});
 
-		it("creates empty agent turn for null answer (Bug Fix: SA-86)", () => {
+		it("preserves per-turn refs when canonical history already exists", () => {
 			const apiItem = makeApiItem({
-				synthQuestion: "Question",
-				answer: null as unknown as string,
-				history: undefined,
+				history: [
+					{ role: "user", msg: "Q1" },
+					{
+						role: "assistant",
+						msg: "A1",
+						refs: [
+							{
+								url: "https://turn-ref.com",
+								content: "turn content",
+								bonus: false,
+							},
+						],
+					},
+				],
 			});
-
 			const result = mapGroundTruthFromApi(apiItem);
-
-			expect(result.history).toHaveLength(2);
-			expect(result.history?.[1]).toMatchObject({
-				role: "agent",
-				content: "",
+			const [ref] = getItemReferences(result);
+			expect(ref).toMatchObject({
+				url: "https://turn-ref.com",
+				messageIndex: 1,
+				turnId: result.history?.[1]?.turnId,
 			});
 		});
+	});
 
-		it("creates empty agent turn for undefined answer (Bug Fix: SA-86)", () => {
+	describe("compat-migration read mapping", () => {
+		it("creates synthesized user and agent turns from legacy single-turn fields", () => {
 			const apiItem = makeApiItem({
-				synthQuestion: "Question",
-				answer: undefined as unknown as string,
+				synthQuestion: "Synth",
+				editedQuestion: "Edited",
+				answer: "A",
 				history: undefined,
 			});
-
 			const result = mapGroundTruthFromApi(apiItem);
-
 			expect(result.history).toHaveLength(2);
+			expect(result.history?.[0]).toMatchObject({
+				role: "user",
+				content: "Edited",
+			});
 			expect(result.history?.[1]).toMatchObject({
 				role: "agent",
-				content: "",
+				content: "A",
 			});
 		});
 
-		it("assigns refs to messageIndex 1 for question with refs but no answer (Bug Fix: SA-86)", () => {
-			// Real-world scenario: curated question with research refs but answer not yet written
+		it("anchors legacy top-level refs to the synthesized agent turn when answer is empty", () => {
 			const apiItem = makeApiItem({
 				editedQuestion: "How do I configure authentication for my app?",
 				answer: "",
@@ -133,130 +120,154 @@ describe("mapGroundTruthFromApi", () => {
 						keyExcerpt: "Use OAuth 2.0 for authentication",
 						bonus: false,
 					},
-					{
-						url: "https://docs.example.com/config",
-						content: "Configuration guide",
-						bonus: false,
-					},
 				],
 				history: undefined,
 			});
-
 			const result = mapGroundTruthFromApi(apiItem);
-
-			// Should create history with empty agent turn
+			const [ref] = getItemReferences(result);
 			expect(result.history).toHaveLength(2);
-			expect(result.history?.[0].content).toBe(
-				"How do I configure authentication for my app?",
-			);
-			expect(result.history?.[1].content).toBe("");
-
-			// All refs should be assigned to the agent turn
-			expect(result.references).toHaveLength(2);
-			expect(result.references[0].messageIndex).toBe(1);
-			expect(result.references[1].messageIndex).toBe(1);
-		});
-
-		it("preserves per-turn refs when history exists", () => {
-			const apiItem = makeApiItem({
-				history: [
-					{
-						role: "user",
-						msg: "Q1",
-						refs: undefined,
-					},
-					{
-						role: "assistant",
-						msg: "A1",
-						refs: [
-							{
-								url: "https://turn-ref.com",
-								content: "turn content",
-								bonus: false,
-							},
-						],
-					},
-				],
-			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
-			// Per-turn refs should be extracted with proper messageIndex
-			// This is tested in the provider tests, so we just verify they exist
-			expect(result.references).toBeDefined();
-		});
-	});
-
-	describe("history mapping", () => {
-		it("converts assistant role to agent", () => {
-			const apiItem = makeApiItem({
-				history: [
-					{ role: "user", msg: "Question" },
-					{ role: "assistant", msg: "Answer" },
-				],
-			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
-			expect(result.history?.[1].role).toBe("agent");
-		});
-
-		it("preserves user role", () => {
-			const apiItem = makeApiItem({
-				history: [{ role: "user", msg: "Question" }],
-			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
-			expect(result.history?.[0].role).toBe("user");
-		});
-
-		it("creates history from synthQuestion when no history provided", () => {
-			const apiItem = makeApiItem({
-				synthQuestion: "Synth Q",
-				answer: "A",
-				history: undefined,
-			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
-			expect(result.history).toHaveLength(2);
-			expect(result.history?.[0].content).toBe("Synth Q");
-			expect(result.history?.[1].content).toBe("A");
-		});
-
-		it("prefers editedQuestion over synthQuestion", () => {
-			const apiItem = makeApiItem({
-				synthQuestion: "Synth",
-				editedQuestion: "Edited",
-				history: undefined,
+			expect(result.history?.[1]).toMatchObject({ role: "agent", content: "" });
+			expect(ref).toMatchObject({
+				url: "https://docs.example.com/auth",
+				messageIndex: 1,
+				turnId: result.history?.[1]?.turnId,
 			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
-			expect(result.history?.[0].content).toBe("Edited");
 		});
 	});
 
 	describe("providerId", () => {
 		it("defaults to 'api' when not provided", () => {
-			const apiItem = makeApiItem({
-				synthQuestion: "Q",
-			});
-
-			const result = mapGroundTruthFromApi(apiItem);
-
+			const result = mapGroundTruthFromApi(makeApiItem({ synthQuestion: "Q" }));
 			expect(result.providerId).toBe("api");
 		});
 
 		it("uses provided providerId", () => {
-			const apiItem = makeApiItem({
-				synthQuestion: "Q",
-			});
+			const result = mapGroundTruthFromApi(
+				makeApiItem({ synthQuestion: "Q" }),
+				"custom-provider",
+			);
+			expect(result.providerId).toBe("custom-provider");
+		});
+	});
+});
 
-			const result = mapGroundTruthFromApi(apiItem, "custom-provider");
+describe("mapper parity: groundTruthFromApi and mapGroundTruthFromApi", () => {
+	function normalizeTurnIdentity(item: ReturnType<typeof groundTruthFromApi>) {
+		const normalizedPlugins = item.plugins
+			? Object.fromEntries(
+					Object.entries(item.plugins).map(([slot, payload]) => [
+						slot,
+						slot === "rag-compat" && payload.data?.retrievals
+							? {
+									...payload,
+									data: {
+										...payload.data,
+										retrievals: Object.fromEntries(
+											Object.entries(
+												payload.data.retrievals as Record<
+													string,
+													{ candidates?: Array<Record<string, unknown>> }
+												>,
+											).map(([key, bucket]) => [
+												key,
+												{
+													...bucket,
+													candidates: (bucket.candidates ?? []).map(
+														(candidate) => ({
+															...candidate,
+															turnId: candidate.turnId
+																? "<normalized>"
+																: undefined,
+														}),
+													),
+												},
+											]),
+										),
+									},
+								}
+							: payload,
+					]),
+				)
+			: item.plugins;
+		return {
+			...item,
+			history: item.history?.map((turn) => ({
+				...turn,
+				turnId: "<normalized>",
+				stepId: "<normalized>",
+			})),
+			plugins: normalizedPlugins,
+		};
+	}
+
+	function makeSharedPayload(
+		overrides: Partial<ApiGroundTruth> = {},
+	): ApiGroundTruth {
+		return {
+			id: "parity-1",
+			status: "draft",
+			answer: "Parity answer",
+			synthQuestion: "Synth parity Q",
+			editedQuestion: "Edited parity Q",
+			history: undefined,
+			refs: [],
+			tags: ["t1"],
+			manualTags: ["m1"],
+			computedTags: ["c1"],
+			comment: "a comment",
+			datasetName: "ds",
+			bucket: "bkt" as ApiGroundTruth["bucket"],
+			_etag: "etag-parity",
+			reviewedAt: "2024-01-01T00:00:00Z",
+			...overrides,
+		} as ApiGroundTruth;
+	}
+
+	it("produces identical output for a legacy single-turn payload", () => {
+		const payload = makeSharedPayload();
+		const fromProvider = groundTruthFromApi(payload);
+		const fromService = mapGroundTruthFromApi(payload);
+		expect(normalizeTurnIdentity(fromProvider)).toEqual(
+			normalizeTurnIdentity(fromService),
+		);
+	});
 
-			expect(result.providerId).toBe("custom-provider");
+	it("produces identical output for a multi-turn payload with per-turn refs", () => {
+		const payload = makeSharedPayload({
+			editedQuestion: "",
+			synthQuestion: "",
+			answer: "",
+			history: [
+				{ role: "user", msg: "First question" },
+				{
+					role: "assistant",
+					msg: "First answer",
+					refs: [{ url: "https://ref1.com", content: "Ref 1", bonus: false }],
+				},
+				{ role: "user", msg: "Follow-up" },
+				{
+					role: "assistant",
+					msg: "Follow-up answer",
+					refs: [{ url: "https://ref2.com", content: "Ref 2", bonus: true }],
+				},
+			],
 		});
+		const fromProvider = groundTruthFromApi(payload);
+		const fromService = mapGroundTruthFromApi(payload);
+		expect(normalizeTurnIdentity(fromProvider)).toEqual(
+			normalizeTurnIdentity(fromService),
+		);
+		expect(getItemReferences(fromProvider)).toHaveLength(2);
+	});
+
+	it("preserves reviewedAt through both paths identically", () => {
+		const payload = makeSharedPayload({ reviewedAt: "2025-06-01T12:00:00Z" });
+		const fromProvider = groundTruthFromApi(payload);
+		const fromService = mapGroundTruthFromApi(payload);
+		expect(fromProvider.reviewedAt).toBe("2025-06-01T12:00:00Z");
+		expect(fromService.reviewedAt).toBe("2025-06-01T12:00:00Z");
+		expect(normalizeTurnIdentity(fromProvider)).toEqual(
+			normalizeTurnIdentity(fromService),
+		);
 	});
 });
diff --git a/frontend/tests/unit/services/http.test.ts b/frontend/tests/unit/services/http.test.ts
new file mode 100644
index 0000000..3e37a23
--- /dev/null
+++ b/frontend/tests/unit/services/http.test.ts
@@ -0,0 +1,35 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+
+describe("http base path helpers", () => {
+	afterEach(() => {
+		vi.resetModules();
+		vi.unstubAllEnvs();
+	});
+
+	it("keeps API paths at the root by default", async () => {
+		const http = await import("../../../src/services/http");
+
+		expect(http.getAppBasePath()).toBe("");
+		expect(http.getApiBaseUrl()).toBe("/v1");
+		expect(http.prefixAppBasePath("/v1/config")).toBe("/v1/config");
+	});
+
+	it("prefixes root-relative paths when BASE_URL is configured", async () => {
+		vi.stubEnv("BASE_URL", "/gtc/");
+		const http = await import("../../../src/services/http");
+
+		expect(http.getAppBasePath()).toBe("/gtc");
+		expect(http.getApiBaseUrl()).toBe("/gtc/v1");
+		expect(http.prefixAppBasePath("/v1/config")).toBe("/gtc/v1/config");
+	});
+
+	it("avoids double-prefixing paths that already include the base path", async () => {
+		vi.stubEnv("BASE_URL", "/gtc/");
+		const http = await import("../../../src/services/http");
+
+		expect(http.prefixAppBasePath("/gtc/v1/config")).toBe("/gtc/v1/config");
+		expect(http.prefixAppBasePath("https://example.com/v1/config")).toBe(
+			"https://example.com/v1/config",
+		);
+	});
+});
diff --git a/frontend/tests/unit/services/runtimeConfig.test.ts b/frontend/tests/unit/services/runtimeConfig.test.ts
new file mode 100644
index 0000000..b1d59ec
--- /dev/null
+++ b/frontend/tests/unit/services/runtimeConfig.test.ts
@@ -0,0 +1,32 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+
+const runtimeConfigFixture = {
+	requireReferenceVisit: true,
+	requireKeyParagraph: false,
+	selfServeLimit: 10,
+	trustedReferenceDomains: ["example.com"],
+};
+
+describe("runtime config base path support", () => {
+	afterEach(() => {
+		vi.resetModules();
+		vi.unstubAllEnvs();
+		vi.unstubAllGlobals();
+	});
+
+	it("fetches runtime config under the configured base path", async () => {
+		vi.stubEnv("BASE_URL", "/gtc/");
+		const fetchMock = vi.fn().mockResolvedValue({
+			ok: true,
+			json: async () => runtimeConfigFixture,
+		});
+		vi.stubGlobal("fetch", fetchMock);
+
+		const { getRuntimeConfig } = await import(
+			"../../../src/services/runtimeConfig"
+		);
+
+		await expect(getRuntimeConfig()).resolves.toEqual(runtimeConfigFixture);
+		expect(fetchMock).toHaveBeenCalledWith("/gtc/v1/config");
+	});
+});
diff --git a/frontend/vite.config.ts b/frontend/vite.config.ts
index 615162c..1b400bd 100644
--- a/frontend/vite.config.ts
+++ b/frontend/vite.config.ts
@@ -2,27 +2,50 @@ import tailwindcss from "@tailwindcss/vite";
 import react from "@vitejs/plugin-react";
 import { defineConfig } from "vite";
 
+const env = (() => {
+	const glb = globalThis as unknown as {
+		process?: { env?: Record<string, string | undefined> };
+	};
+	return glb.process?.env ?? {};
+})();
+
+const backendProxyTarget = env.HARNESS_BACKEND_URL ?? "http://localhost:8000";
+
+function normalizeViteBasePath(basePath: string | undefined): string {
+	if (!basePath) return "/";
+	const trimmed = basePath.trim();
+	if (!trimmed || trimmed === "/") return "/";
+	return `/${trimmed.replace(/^\/+|\/+$/g, "")}/`;
+}
+
+const appBasePath = normalizeViteBasePath(env.VITE_APP_BASE_PATH);
+const apiProxyPrefixes = Array.from(
+	new Set([
+		"/v1",
+		`${appBasePath.endsWith("/") ? appBasePath.slice(0, -1) : appBasePath}/v1`,
+	]),
+);
+const proxy = Object.fromEntries(
+	apiProxyPrefixes.map((prefix) => [
+		prefix,
+		{
+			target: backendProxyTarget,
+			changeOrigin: true,
+			secure: false,
+		},
+	]),
+);
+
 // https://vite.dev/config/
 export default defineConfig({
+	base: appBasePath,
 	plugins: [react(), tailwindcss()],
 	define: {
 		// Make DEMO_MODE available to the client non-prefixed
-		...(() => {
-			const glb = globalThis as unknown as {
-				process?: { env?: Record<string, string | undefined> };
-			};
-			const demo = glb.process?.env?.DEMO_MODE ?? "";
-			return { "import.meta.env.DEMO_MODE": JSON.stringify(demo) };
-		})(),
+		"import.meta.env.DEMO_MODE": JSON.stringify(env.DEMO_MODE ?? ""),
 	},
 	server: {
-		proxy: {
-			// Forward API calls to backend in dev to avoid CORS
-			"/v1": {
-				target: "http://localhost:8000",
-				changeOrigin: true,
-				secure: false,
-			},
-		},
+		// Forward API calls to backend in dev to avoid CORS
+		proxy,
 	},
 });
diff --git a/scripts/audit_harness.sh b/scripts/audit_harness.sh
new file mode 100755
index 0000000..0a19578
--- /dev/null
+++ b/scripts/audit_harness.sh
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage: scripts/audit_harness.sh [repo_path]
+
+Audit a repository for baseline harness engineering artifacts.
+EOF
+}
+
+target_path="${1:-.}"
+if [ "$target_path" = "-h" ] || [ "$target_path" = "--help" ]; then
+  usage
+  exit 0
+fi
+
+if [ ! -d "$target_path" ]; then
+  echo "error: target path does not exist: $target_path" >&2
+  exit 1
+fi
+
+target_path=$(cd "$target_path" && pwd)
+failures=0
+
+ok() {
+  echo "[ok]      $1"
+}
+
+fail() {
+  echo "[missing] $1"
+  failures=$((failures + 1))
+}
+
+check_file() {
+  local relative="$1"
+  if [ -f "$target_path/$relative" ]; then
+    ok "$relative"
+  else
+    fail "$relative"
+  fi
+}
+
+check_contains() {
+  local relative="$1"
+  local pattern="$2"
+  local label="$3"
+  local full="$target_path/$relative"
+
+  if [ ! -f "$full" ]; then
+    fail "$label (file missing: $relative)"
+    return
+  fi
+
+  if grep -Eq "$pattern" "$full"; then
+    ok "$label"
+  else
+    fail "$label"
+  fi
+}
+
+echo "Auditing harness artifacts in: $target_path"
+echo
+
+check_file "AGENTS.md"
+check_file "docs/ARCHITECTURE.md"
+check_file "docs/OBSERVABILITY.md"
+check_file "Makefile.harness"
+check_file "scripts/audit_harness.sh"
+check_file "scripts/harness/smoke.sh"
+check_file "scripts/harness/test.sh"
+check_file "scripts/harness/lint.sh"
+check_file "scripts/harness/typecheck.sh"
+check_file ".github/workflows/harness.yml"
+
+echo
+check_contains "AGENTS.md" "Harness Commands" "AGENTS.md: Harness Commands section"
+check_contains "AGENTS.md" "Execution Plans" "AGENTS.md: Execution Plans section"
+check_contains "docs/ARCHITECTURE.md" "Boundaries" "ARCHITECTURE.md: boundary guidance"
+check_contains "docs/OBSERVABILITY.md" "Required Event Fields" "OBSERVABILITY.md: required fields"
+check_contains "Makefile.harness" "^smoke:" "Makefile.harness: smoke target"
+check_contains "Makefile.harness" "^test:" "Makefile.harness: test target"
+check_contains "Makefile.harness" "^lint:" "Makefile.harness: lint target"
+check_contains "Makefile.harness" "^typecheck:" "Makefile.harness: typecheck target"
+check_contains "Makefile.harness" "^ci:" "Makefile.harness: ci target"
+check_contains ".github/workflows/harness.yml" "make( -f Makefile\.harness)? ci" "CI workflow executes harness ci"
+
+echo
+if [ "$failures" -gt 0 ]; then
+  echo "Harness audit failed: $failures issue(s) detected."
+  exit 1
+fi
+
+echo "Harness audit passed."
diff --git a/scripts/harness/api_check.sh b/scripts/harness/api_check.sh
new file mode 100755
index 0000000..364db44
--- /dev/null
+++ b/scripts/harness/api_check.sh
@@ -0,0 +1,31 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_API_CHECK_CMD:-}" ]; then
+  eval "$HARNESS_API_CHECK_CMD"
+  exit 0
+fi
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for API checks.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for API checks.' >&2; exit 1; }
+
+echo '==> Exporting OpenAPI spec'
+(
+  cd "$root_dir/backend"
+  uv run python scripts/export_openapi.py
+)
+
+echo '==> Checking committed OpenAPI spec'
+git -C "$root_dir" --no-pager diff --exit-code -- frontend/src/api/openapi.json || {
+  echo 'ERROR: OpenAPI spec is out of date. Run: cd backend && uv run python scripts/export_openapi.py' >&2
+  exit 1
+}
+
+echo '==> Checking generated frontend API types'
+(
+  cd "$root_dir/frontend"
+  npm run api:types:check
+)
diff --git a/scripts/harness/backend_integration_test.sh b/scripts/harness/backend_integration_test.sh
new file mode 100755
index 0000000..688b9f0
--- /dev/null
+++ b/scripts/harness/backend_integration_test.sh
@@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_BACKEND_INTEGRATION_TEST_CMD:-}" ]; then
+  eval "$HARNESS_BACKEND_INTEGRATION_TEST_CMD"
+  exit 0
+fi
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend integration tests.' >&2; exit 1; }
+command -v curl >/dev/null 2>&1 || { echo 'ERROR: curl is required for backend integration tests.' >&2; exit 1; }
+
+cosmos_ready_url="${HARNESS_COSMOS_READY_URL:-${GTC_COSMOS_ENDPOINT:-http://localhost:8081}}"
+
+echo "==> Waiting for Cosmos emulator at ${cosmos_ready_url}"
+for _ in $(seq 1 60); do
+  if curl -sS --max-time 2 "$cosmos_ready_url" >/dev/null 2>&1; then
+    break
+  fi
+  sleep 2
+done
+
+if ! curl -sS --max-time 2 "$cosmos_ready_url" >/dev/null 2>&1; then
+  echo "ERROR: Cosmos emulator did not become ready at ${cosmos_ready_url}." >&2
+  exit 1
+fi
+
+echo '==> Backend integration tests'
+(
+  cd "$root_dir/backend"
+  GTC_LLM_ENABLED=False uv run pytest -q tests/integration -v --junitxml=pytest-int-results.xml
+)
diff --git a/scripts/harness/deploy_backend.sh b/scripts/harness/deploy_backend.sh
new file mode 100755
index 0000000..95d4542
--- /dev/null
+++ b/scripts/harness/deploy_backend.sh
@@ -0,0 +1,316 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+err() {
+  printf "ERROR: %s\n" "$1" >&2
+  exit 1
+}
+
+missing_vars=()
+
+require_cmd() {
+  local cmd="$1"
+  command -v "$cmd" >/dev/null 2>&1 || err "'$cmd' command is required"
+}
+
+require_var() {
+  local name="$1"
+  [[ -n "${!name:-}" ]] || missing_vars+=("$name")
+}
+
+python_cmd() {
+  if command -v python3 >/dev/null 2>&1; then
+    echo python3
+    return
+  fi
+  if command -v python >/dev/null 2>&1; then
+    echo python
+    return
+  fi
+  err "python3 or python is required"
+}
+
+require_cmd az
+
+PYTHON_BIN="$(python_cmd)"
+
+require_var AZURE_SUBSCRIPTION_ID
+require_var AZURE_TENANT_ID
+require_var RESOURCE_GROUP_NAME
+require_var GROUND_TRUTH_CURATION_NAME
+require_var USER_ASSIGNED_IDENTITY
+require_var ENVIRONMENT_NAME
+require_var AUTH_CLIENT_ID
+require_var GT_COSMOS_DB_ACCOUNT
+require_var GTC_COSMOS_DB_NAME
+
+if [[ -z "${CONTAINER_IMAGE:-}" ]]; then
+  require_var REGISTRY_PREFIX
+  require_var TAG_NAME
+fi
+
+if (( ${#missing_vars[@]} > 0 )); then
+  err "Missing required environment variables: ${missing_vars[*]}"
+fi
+
+GT_COSMOS_DB_RESOURCE_GROUP="${GT_COSMOS_DB_RESOURCE_GROUP:-$RESOURCE_GROUP_NAME}"
+GT_COSMOS_DB_INDEXING_POLICY="${GT_COSMOS_DB_INDEXING_POLICY:-backend/scripts/indexing-policy.json}"
+WORKLOAD_PROFILE_NAME="${WORKLOAD_PROFILE_NAME:-ai-apps}"
+CONTAINER_CPU="${CONTAINER_CPU:-0.5}"
+CONTAINER_MEMORY="${CONTAINER_MEMORY:-1.0Gi}"
+CONTAINER_IMAGE="${CONTAINER_IMAGE:-${REGISTRY_PREFIX}.azurecr.io/gtc-backend:${TAG_NAME}}"
+REGISTRY_SERVER="${REGISTRY_SERVER:-}"
+if [[ -z "$REGISTRY_SERVER" && "$CONTAINER_IMAGE" == */* ]]; then
+  REGISTRY_SERVER="${CONTAINER_IMAGE%%/*}"
+fi
+[[ -n "$REGISTRY_SERVER" ]] || err "Set REGISTRY_SERVER or provide CONTAINER_IMAGE with a registry hostname"
+GTC_EZAUTH_ALLOW_ANONYMOUS_PATHS="${GTC_EZAUTH_ALLOW_ANONYMOUS_PATHS:-/healthz,/metrics}"
+GTC_COSMOS_CONTAINER_GT="${GTC_COSMOS_CONTAINER_GT:-ground_truth}"
+GTC_COSMOS_CONTAINER_ASSIGNMENTS="${GTC_COSMOS_CONTAINER_ASSIGNMENTS:-assignments}"
+GTC_COSMOS_CONTAINER_TAGS="${GTC_COSMOS_CONTAINER_TAGS:-tags}"
+GT_COSMOS_CONTAINER_GT_MAX_THROUGHPUT="${GT_COSMOS_CONTAINER_GT_MAX_THROUGHPUT:-1000}"
+GT_COSMOS_CONTAINER_ASSIGNMENTS_MAX_THROUGHPUT="${GT_COSMOS_CONTAINER_ASSIGNMENTS_MAX_THROUGHPUT:-1000}"
+GT_COSMOS_CONTAINER_TAGS_MAX_THROUGHPUT="${GT_COSMOS_CONTAINER_TAGS_MAX_THROUGHPUT:-1000}"
+GTC_COSMOS_CONTAINER_TAG_DEFINITIONS="${GTC_COSMOS_CONTAINER_TAG_DEFINITIONS:-}"
+GT_COSMOS_CONTAINER_TAG_DEFINITIONS_MAX_THROUGHPUT="${GT_COSMOS_CONTAINER_TAG_DEFINITIONS_MAX_THROUGHPUT:-$GT_COSMOS_CONTAINER_TAGS_MAX_THROUGHPUT}"
+
+[[ -f "$GT_COSMOS_DB_INDEXING_POLICY" ]] || err "Indexing policy file not found: $GT_COSMOS_DB_INDEXING_POLICY"
+
+az account show --output none >/dev/null 2>&1 || err "Azure CLI is not logged in"
+az account set --subscription "$AZURE_SUBSCRIPTION_ID"
+az extension add --name containerapp --upgrade
+
+printf 'Deploying image: %s\n' "$CONTAINER_IMAGE"
+printf 'Container app: %s\n' "$GROUND_TRUTH_CURATION_NAME"
+printf 'Resource group: %s\n' "$RESOURCE_GROUP_NAME"
+printf 'Cosmos account: %s\n' "$GT_COSMOS_DB_ACCOUNT"
+
+managed_identity_resource="/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCE_GROUP_NAME}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/${USER_ASSIGNED_IDENTITY}"
+
+tmp_dir="$(mktemp -d)"
+trap 'rm -rf "$tmp_dir"' EXIT
+env_yaml_file="$tmp_dir/env.yaml"
+config_file="$tmp_dir/container-config.yaml"
+
+"$PYTHON_BIN" - <<'PY' >"$env_yaml_file"
+import json
+import os
+
+keys = []
+for key in os.environ:
+    if (
+        key.startswith("GTC_")
+        or key.startswith("APPLICATIONINSIGHTS_")
+        or key == "AZURE_CLIENT_ID"
+        or key.startswith("HEALTHCHECK_")
+    ):
+        keys.append(key)
+
+for key in sorted(keys):
+    print(f"                    - name: {key}")
+    print(f"                      value: {json.dumps(os.environ[key])}")
+PY
+
+env_line="env: []"
+env_block=""
+if [[ -s "$env_yaml_file" ]]; then
+  env_line="env:"
+  env_block="$(cat "$env_yaml_file")"
+fi
+
+pip_install=("$PYTHON_BIN" -m pip install --disable-pip-version-check)
+if "$PYTHON_BIN" -m pip help install 2>/dev/null | grep -q -- "--break-system-packages"; then
+  pip_install+=(--break-system-packages)
+fi
+
+{
+cat <<EOF
+identity:
+  type: UserAssigned
+  userAssignedIdentities:
+    "${managed_identity_resource}": {}
+properties:
+  workloadProfileName: ${WORKLOAD_PROFILE_NAME}
+  configuration:
+    ingress:
+      external: true
+      targetPort: 8080
+  template:
+    replicas:
+      min: 1
+      max: 1
+    scale:
+      rules: []
+    containers:
+      - name: ${GROUND_TRUTH_CURATION_NAME}
+        image: ${CONTAINER_IMAGE}
+        resources:
+          cpu: ${CONTAINER_CPU}
+          memory: ${CONTAINER_MEMORY}
+        ${env_line}
+EOF
+if [[ -n "$env_block" ]]; then
+  printf '%s\n' "$env_block"
+fi
+cat <<EOF
+        probes:
+          - type: Liveness
+            httpGet:
+              path: "/healthz"
+              port: 8080
+            initialDelaySeconds: 5
+            periodSeconds: 60
+            timeoutSeconds: 30
+            successThreshold: 1
+            failureThreshold: 3
+          - type: Readiness
+            httpGet:
+              path: "/healthz"
+              port: 8080
+            initialDelaySeconds: 0
+            periodSeconds: 10
+            timeoutSeconds: 1
+            successThreshold: 1
+            failureThreshold: 3
+EOF
+} >"$config_file"
+
+if az containerapp show --name "$GROUND_TRUTH_CURATION_NAME" --resource-group "$RESOURCE_GROUP_NAME" >/dev/null 2>&1; then
+  az containerapp update \
+    --name "$GROUND_TRUTH_CURATION_NAME" \
+    --resource-group "$RESOURCE_GROUP_NAME" \
+    --yaml "$config_file"
+else
+  az containerapp create \
+    --name "$GROUND_TRUTH_CURATION_NAME" \
+    --registry-identity system-environment \
+    --registry-server "$REGISTRY_SERVER" \
+    --resource-group "$RESOURCE_GROUP_NAME" \
+    --environment "$ENVIRONMENT_NAME" \
+    --image "$CONTAINER_IMAGE" \
+    --workload-profile-name "$WORKLOAD_PROFILE_NAME" \
+    --min-replicas 1 \
+    --max-replicas 1 \
+    --user-assigned "$managed_identity_resource"
+
+  az containerapp update \
+    --name "$GROUND_TRUTH_CURATION_NAME" \
+    --resource-group "$RESOURCE_GROUP_NAME" \
+    --yaml "$config_file"
+fi
+
+az containerapp update \
+  --name "$GROUND_TRUTH_CURATION_NAME" \
+  --resource-group "$RESOURCE_GROUP_NAME" \
+  --min-replicas 1 \
+  --max-replicas 1
+
+while IFS= read -r rule; do
+  [[ -n "$rule" ]] || continue
+  az containerapp update \
+    --name "$GROUND_TRUTH_CURATION_NAME" \
+    --resource-group "$RESOURCE_GROUP_NAME" \
+    --remove-scale-rule "$rule"
+done < <(
+  az containerapp show \
+    --name "$GROUND_TRUTH_CURATION_NAME" \
+    --resource-group "$RESOURCE_GROUP_NAME" \
+    --query "properties.template.scale.rules[].name" \
+    --output tsv
+)
+
+az containerapp revision set-mode \
+  --name "$GROUND_TRUTH_CURATION_NAME" \
+  --resource-group "$RESOURCE_GROUP_NAME" \
+  --mode single
+
+az containerapp auth update \
+  --name "$GROUND_TRUTH_CURATION_NAME" \
+  --resource-group "$RESOURCE_GROUP_NAME" \
+  --enabled true \
+  --unauthenticated-client-action RedirectToLoginPage \
+  --excluded-paths "$GTC_EZAUTH_ALLOW_ANONYMOUS_PATHS" \
+  --yes
+
+az containerapp auth microsoft update \
+  --name "$GROUND_TRUTH_CURATION_NAME" \
+  --resource-group "$RESOURCE_GROUP_NAME" \
+  --client-id "$AUTH_CLIENT_ID" \
+  --tenant-id "$AZURE_TENANT_ID"
+
+if ! az cosmosdb sql database show \
+  --account-name "$GT_COSMOS_DB_ACCOUNT" \
+  --name "$GTC_COSMOS_DB_NAME" \
+  --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP" >/dev/null 2>&1; then
+  az cosmosdb sql database create \
+    --account-name "$GT_COSMOS_DB_ACCOUNT" \
+    --name "$GTC_COSMOS_DB_NAME" \
+    --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP"
+fi
+
+if az cosmosdb sql container show \
+  --account-name "$GT_COSMOS_DB_ACCOUNT" \
+  --database-name "$GTC_COSMOS_DB_NAME" \
+  --name "$GTC_COSMOS_CONTAINER_GT" \
+  --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP" >/dev/null 2>&1; then
+  az cosmosdb sql container update \
+    --account-name "$GT_COSMOS_DB_ACCOUNT" \
+    --database-name "$GTC_COSMOS_DB_NAME" \
+    --name "$GTC_COSMOS_CONTAINER_GT" \
+    --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP" \
+    --idx "@${GT_COSMOS_DB_INDEXING_POLICY}"
+else
+  "${pip_install[@]}" azure-identity azure-cosmos
+  "$PYTHON_BIN" backend/scripts/cosmos_container_manager.py \
+    --endpoint "https://${GT_COSMOS_DB_ACCOUNT}.documents.azure.com:443/" \
+    --use-aad \
+    --db "$GTC_COSMOS_DB_NAME" \
+    --gt-container "$GTC_COSMOS_CONTAINER_GT" \
+    --indexing-policy "$GT_COSMOS_DB_INDEXING_POLICY" \
+    --max-throughput "$GT_COSMOS_CONTAINER_GT_MAX_THROUGHPUT"
+fi
+
+az cosmosdb sql container throughput update \
+  --account-name "$GT_COSMOS_DB_ACCOUNT" \
+  --database-name "$GTC_COSMOS_DB_NAME" \
+  --name "$GTC_COSMOS_CONTAINER_GT" \
+  --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP" \
+  --max-throughput "$GT_COSMOS_CONTAINER_GT_MAX_THROUGHPUT"
+
+ensure_simple_container() {
+  local container_name="$1"
+  local partition_key="$2"
+  local max_throughput="$3"
+
+  if ! az cosmosdb sql container show \
+    --account-name "$GT_COSMOS_DB_ACCOUNT" \
+    --database-name "$GTC_COSMOS_DB_NAME" \
+    --name "$container_name" \
+    --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP" >/dev/null 2>&1; then
+    az cosmosdb sql container create \
+      --account-name "$GT_COSMOS_DB_ACCOUNT" \
+      --database-name "$GTC_COSMOS_DB_NAME" \
+      --name "$container_name" \
+      --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP" \
+      --partition-key-path "$partition_key" \
+      --max-throughput "$max_throughput"
+  fi
+
+  az cosmosdb sql container throughput update \
+    --account-name "$GT_COSMOS_DB_ACCOUNT" \
+    --database-name "$GTC_COSMOS_DB_NAME" \
+    --name "$container_name" \
+    --resource-group "$GT_COSMOS_DB_RESOURCE_GROUP" \
+    --max-throughput "$max_throughput"
+}
+
+ensure_simple_container "$GTC_COSMOS_CONTAINER_ASSIGNMENTS" "/pk" "$GT_COSMOS_CONTAINER_ASSIGNMENTS_MAX_THROUGHPUT"
+ensure_simple_container "$GTC_COSMOS_CONTAINER_TAGS" "/pk" "$GT_COSMOS_CONTAINER_TAGS_MAX_THROUGHPUT"
+
+if [[ -n "$GTC_COSMOS_CONTAINER_TAG_DEFINITIONS" ]]; then
+  ensure_simple_container \
+    "$GTC_COSMOS_CONTAINER_TAG_DEFINITIONS" \
+    "/tag_key" \
+    "$GT_COSMOS_CONTAINER_TAG_DEFINITIONS_MAX_THROUGHPUT"
+fi
diff --git a/scripts/harness/dev_down.sh b/scripts/harness/dev_down.sh
new file mode 100755
index 0000000..37607d6
--- /dev/null
+++ b/scripts/harness/dev_down.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+backend_pid_file="$root_dir/.harness/dev/backend.pid"
+frontend_pid_file="$root_dir/.harness/dev/frontend.pid"
+overall_status=0
+
+read_pid() {
+  local pid_file="$1"
+  if [ -f "$pid_file" ]; then
+    tr -d '[:space:]' <"$pid_file"
+  fi
+}
+
+pid_is_running() {
+  local pid="$1"
+  [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null
+}
+
+stop_process() {
+  local name="$1"
+  local pid_file="$2"
+  local pid
+
+  pid=$(read_pid "$pid_file")
+  if [ -z "$pid" ]; then
+    echo "==> ${name} is not running"
+    return
+  fi
+
+  if ! pid_is_running "$pid"; then
+    echo "==> Removing stale ${name} PID file (${pid})"
+    rm -f "$pid_file"
+    return
+  fi
+
+  echo "==> Stopping ${name} (PID ${pid})"
+  kill "$pid"
+
+  for _ in $(seq 1 10); do
+    if ! pid_is_running "$pid"; then
+      rm -f "$pid_file"
+      return
+    fi
+    sleep 1
+  done
+
+  echo "ERROR: ${name} (PID ${pid}) did not stop cleanly." >&2
+  overall_status=1
+}
+
+stop_process backend "$backend_pid_file"
+stop_process frontend "$frontend_pid_file"
+
+exit "$overall_status"
diff --git a/scripts/harness/dev_up.sh b/scripts/harness/dev_up.sh
new file mode 100755
index 0000000..294a14e
--- /dev/null
+++ b/scripts/harness/dev_up.sh
@@ -0,0 +1,148 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+backend_pid_file="$root_dir/.harness/dev/backend.pid"
+frontend_pid_file="$root_dir/.harness/dev/frontend.pid"
+backend_log_file="$root_dir/.harness/dev/backend.log"
+frontend_log_file="$root_dir/.harness/dev/frontend.log"
+backend_port="${HARNESS_BACKEND_PORT:-8000}"
+frontend_port="${HARNESS_FRONTEND_PORT:-5173}"
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend startup.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for frontend startup.' >&2; exit 1; }
+
+mkdir -p "$root_dir/.harness/dev"
+
+read_pid() {
+  local pid_file="$1"
+  if [ -f "$pid_file" ]; then
+    tr -d '[:space:]' <"$pid_file"
+  fi
+}
+
+pid_is_running() {
+  local pid="$1"
+  [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null
+}
+
+port_listener_info() {
+  local port="$1"
+
+  if ! command -v lsof >/dev/null 2>&1; then
+    return 1
+  fi
+
+  lsof -nP -iTCP:"$port" -sTCP:LISTEN 2>/dev/null | tail -n +2
+}
+
+resolve_available_port() {
+  local name="$1"
+  local requested_port="$2"
+  local port="$requested_port"
+  local listener_info
+  local attempts=0
+
+  while [ "$attempts" -lt 100 ]; do
+    listener_info=$(port_listener_info "$port" || true)
+    if [ -z "$listener_info" ]; then
+      if [ "$port" -ne "$requested_port" ]; then
+        echo "==> ${name} port ${requested_port} is busy; using ${port} instead" >&2
+      fi
+      printf '%s\n' "$port"
+      return 0
+    fi
+
+    if [ "$attempts" -eq 0 ]; then
+      echo "==> ${name} port ${requested_port} is busy; searching for the next available port" >&2
+      echo "$listener_info" >&2
+    fi
+
+    port=$((port + 1))
+    attempts=$((attempts + 1))
+  done
+
+  echo "ERROR: unable to find an available ${name} port starting at ${requested_port}." >&2
+  return 1
+}
+
+ensure_not_running() {
+  local name="$1"
+  local pid_file="$2"
+  local pid
+
+  pid=$(read_pid "$pid_file")
+  if pid_is_running "$pid"; then
+    echo "ERROR: ${name} is already running with PID ${pid}. Use 'make -f Makefile.harness dev-down' first." >&2
+    exit 1
+  fi
+
+  rm -f "$pid_file"
+}
+
+start_process() {
+  local name="$1"
+  local pid_file="$2"
+  local log_file="$3"
+  shift 3
+
+  : >"$log_file"
+
+  (
+    cd "$root_dir/$name"
+    nohup "$@" >"$log_file" 2>&1 &
+    echo $! >"$pid_file"
+  )
+}
+
+cleanup_started_processes() {
+  local pid
+  for pid in "$(read_pid "$backend_pid_file")" "$(read_pid "$frontend_pid_file")"; do
+    if pid_is_running "$pid"; then
+      kill "$pid" 2>/dev/null || true
+    fi
+  done
+
+  rm -f "$backend_pid_file" "$frontend_pid_file"
+}
+
+ensure_not_running "backend" "$backend_pid_file"
+ensure_not_running "frontend" "$frontend_pid_file"
+backend_port=$(resolve_available_port "backend" "$backend_port")
+frontend_port=$(resolve_available_port "frontend" "$frontend_port")
+
+echo '==> Starting backend in background'
+start_process backend "$backend_pid_file" "$backend_log_file" uv run uvicorn app.main:app --reload --host 127.0.0.1 --port "$backend_port"
+
+echo '==> Starting frontend in background'
+start_process frontend "$frontend_pid_file" "$frontend_log_file" env HARNESS_BACKEND_URL="http://127.0.0.1:${backend_port}" npm run dev -- --port "$frontend_port"
+
+sleep 2
+
+backend_pid=$(read_pid "$backend_pid_file")
+frontend_pid=$(read_pid "$frontend_pid_file")
+startup_failed=0
+
+if ! pid_is_running "$backend_pid"; then
+  echo "ERROR: backend exited during startup. Check $backend_log_file" >&2
+  startup_failed=1
+fi
+
+if ! pid_is_running "$frontend_pid"; then
+  echo "ERROR: frontend exited during startup. Check $frontend_log_file" >&2
+  startup_failed=1
+fi
+
+if [ "$startup_failed" -ne 0 ]; then
+  cleanup_started_processes
+  exit 1
+fi
+
+echo "Backend PID: $backend_pid"
+echo "Frontend PID: $frontend_pid"
+echo "Backend URL: http://127.0.0.1:${backend_port}"
+echo "Frontend URL: http://127.0.0.1:${frontend_port}"
+echo "Logs: $root_dir/.harness/dev/"
+echo "Stop both with: make -f Makefile.harness dev-down"
diff --git a/scripts/harness/format.sh b/scripts/harness/format.sh
new file mode 100755
index 0000000..85f0e3f
--- /dev/null
+++ b/scripts/harness/format.sh
@@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_FORMAT_CMD:-}" ]; then
+  eval "$HARNESS_FORMAT_CMD"
+  exit 0
+fi
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend formatting.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for frontend formatting.' >&2; exit 1; }
+
+echo '==> Formatting backend'
+(
+  cd "$root_dir/backend"
+  uv run ruff format app tests scripts
+  uv run ruff check --fix app tests scripts
+)
+
+echo '==> Formatting frontend'
+(
+  cd "$root_dir/frontend"
+  npm run lint
+)
diff --git a/scripts/harness/lint.sh b/scripts/harness/lint.sh
new file mode 100755
index 0000000..32a4f2d
--- /dev/null
+++ b/scripts/harness/lint.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_LINT_CMD:-}" ]; then
+  eval "$HARNESS_LINT_CMD"
+  exit 0
+fi
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend linting.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for frontend linting.' >&2; exit 1; }
+
+echo '==> Backend lint'
+(
+  cd "$root_dir/backend"
+  uv run ruff check app/
+)
+
+echo '==> Frontend lint'
+(
+  cd "$root_dir/frontend"
+  npm run lint:check
+)
diff --git a/scripts/harness/setup.sh b/scripts/harness/setup.sh
new file mode 100755
index 0000000..750b52e
--- /dev/null
+++ b/scripts/harness/setup.sh
@@ -0,0 +1,31 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_SETUP_CMD:-}" ]; then
+  eval "$HARNESS_SETUP_CMD"
+  exit 0
+fi
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend setup.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for frontend setup.' >&2; exit 1; }
+
+mkdir -p "$root_dir/.harness"
+
+echo '==> Syncing backend dependencies'
+(
+  cd "$root_dir/backend"
+  uv sync --frozen
+)
+
+echo '==> Installing frontend dependencies'
+(
+  cd "$root_dir/frontend"
+  if [ -f package-lock.json ]; then
+    npm ci --no-audit --no-fund
+  else
+    npm install
+  fi
+)
diff --git a/scripts/harness/smoke.sh b/scripts/harness/smoke.sh
new file mode 100755
index 0000000..0593c5b
--- /dev/null
+++ b/scripts/harness/smoke.sh
@@ -0,0 +1,73 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+PORT="${HARNESS_SMOKE_PORT:-8008}"
+HEALTH_URL="${HARNESS_SMOKE_HEALTH_URL:-http://127.0.0.1:${PORT}/healthz}"
+OPENAPI_URL="${HARNESS_SMOKE_URL:-http://127.0.0.1:${PORT}/v1/openapi.json}"
+SERVER_LOG="$root_dir/.harness/smoke-server.log"
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend smoke tests.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for frontend smoke tests.' >&2; exit 1; }
+command -v curl >/dev/null 2>&1 || { echo 'ERROR: curl is required for smoke tests.' >&2; exit 1; }
+
+mkdir -p "$root_dir/.harness"
+: > "$root_dir/.harness/logs.jsonl"
+: > "$root_dir/.harness/traces.jsonl"
+: > "$SERVER_LOG"
+
+SERVER_PID=''
+cleanup() {
+  if [ -n "$SERVER_PID" ] && kill -0 "$SERVER_PID" 2>/dev/null; then
+    kill "$SERVER_PID" 2>/dev/null || true
+    wait "$SERVER_PID" 2>/dev/null || true
+  fi
+}
+trap cleanup EXIT
+
+echo '==> Starting backend smoke server'
+(
+  cd "$root_dir/backend"
+  env \
+    GTC_AZ_MONITOR_ENABLED=false \
+    GTC_HARNESS_JSONL_ENABLED=true \
+    uv run uvicorn app.main:app --host 127.0.0.1 --port "$PORT"
+) >"$SERVER_LOG" 2>&1 &
+SERVER_PID=$!
+
+echo "==> Waiting for $HEALTH_URL"
+for _ in $(seq 1 20); do
+  if curl -fsS "$HEALTH_URL" >/dev/null; then
+    break
+  fi
+  sleep 1
+done
+
+if ! curl -fsS "$HEALTH_URL" >/dev/null; then
+  echo 'ERROR: backend health probe failed.' >&2
+  tail -50 "$SERVER_LOG" >&2 || true
+  exit 1
+fi
+
+echo "==> Probing $OPENAPI_URL"
+curl -fsS "$OPENAPI_URL" >/dev/null
+
+echo '==> Building frontend bundle'
+(
+  cd "$root_dir/frontend"
+  npm run build
+)
+
+if [ ! -s "$root_dir/.harness/logs.jsonl" ]; then
+  echo 'ERROR: smoke run did not emit .harness/logs.jsonl entries.' >&2
+  exit 1
+fi
+
+if [ ! -s "$root_dir/.harness/traces.jsonl" ]; then
+  echo 'ERROR: smoke run did not emit .harness/traces.jsonl entries.' >&2
+  exit 1
+fi
+
+echo 'Smoke test passed ✅'
diff --git a/scripts/harness/test.sh b/scripts/harness/test.sh
new file mode 100755
index 0000000..262241a
--- /dev/null
+++ b/scripts/harness/test.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_TEST_CMD:-}" ]; then
+  eval "$HARNESS_TEST_CMD"
+  exit 0
+fi
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend tests.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for frontend tests.' >&2; exit 1; }
+
+echo '==> Backend unit tests'
+(
+  cd "$root_dir/backend"
+  GTC_LLM_ENABLED=False uv run pytest -q tests/unit/ -v --junitxml=pytest-unit-results.xml
+)
+
+echo '==> Frontend unit tests'
+(
+  cd "$root_dir/frontend"
+  npm run test:run -- --pool=threads --poolOptions.threads.singleThread
+)
diff --git a/scripts/harness/typecheck.sh b/scripts/harness/typecheck.sh
new file mode 100755
index 0000000..3461c71
--- /dev/null
+++ b/scripts/harness/typecheck.sh
@@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+root_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "$root_dir"
+
+if [ -n "${HARNESS_TYPECHECK_CMD:-}" ]; then
+  eval "$HARNESS_TYPECHECK_CMD"
+  exit 0
+fi
+
+command -v uv >/dev/null 2>&1 || { echo 'ERROR: uv is required for backend type checking.' >&2; exit 1; }
+command -v npm >/dev/null 2>&1 || { echo 'ERROR: npm is required for frontend type checking.' >&2; exit 1; }
+
+ty_output_format="${HARNESS_TY_OUTPUT_FORMAT:-concise}"
+
+echo '==> Backend typecheck'
+(
+  cd "$root_dir/backend"
+  uv run ty check app/ --output-format "$ty_output_format" --exclude app/adapters/inference/inference.py --force-exclude
+)
+
+echo '==> Frontend typecheck'
+(
+  cd "$root_dir/frontend"
+  npm run typecheck
+)
diff --git a/scripts/verify_customized.sh b/scripts/verify_customized.sh
new file mode 100755
index 0000000..1f5ab19
--- /dev/null
+++ b/scripts/verify_customized.sh
@@ -0,0 +1,165 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# VERIFY CUSTOMIZED — Checks that template boilerplate has been replaced.
+#
+# Run after setting up harness engineering to catch leftover placeholders.
+# Exit code 0 = all customized, 1 = boilerplate detected.
+
+target_path="${1:-.}"
+target_path=$(cd "$target_path" && pwd)
+
+failures=0
+total=0
+
+pass() {
+  total=$((total + 1))
+  echo "  ✅ $1"
+}
+
+fail() {
+  total=$((total + 1))
+  failures=$((failures + 1))
+  echo "  ❌ $1"
+}
+
+check_no_placeholders() {
+  local file="$1"
+  local label="$2"
+  local full="$target_path/$file"
+
+  if [ ! -f "$full" ]; then
+    fail "$label — file missing"
+    return
+  fi
+
+  if grep -qE '<project-name>|<runtime>|<entrypoints>|_Replace:|_Replace_' "$full"; then
+    fail "$label — still contains template placeholders"
+    echo "       $(grep -nE '<project-name>|<runtime>|<entrypoints>|_Replace:|_Replace_' "$full" | head -3)"
+    return
+  fi
+
+  pass "$label"
+}
+
+check_not_empty_template() {
+  local file="$1"
+  local label="$2"
+  local min_lines="${3:-10}"
+  local full="$target_path/$file"
+
+  if [ ! -f "$full" ]; then
+    fail "$label — file missing"
+    return
+  fi
+
+  local lines
+  lines=$(wc -l < "$full" | tr -d ' ')
+  if [ "$lines" -lt "$min_lines" ]; then
+    fail "$label — only $lines lines (expected >=$min_lines, likely not customized)"
+    return
+  fi
+
+  pass "$label ($lines lines)"
+}
+
+check_plans_customized() {
+  local full="$target_path/PLANS.md"
+
+  if [ ! -f "$full" ]; then
+    echo "  ℹ️  PLANS.md — not present (optional)"
+    return
+  fi
+
+  local placeholder_count
+  placeholder_count=$(grep -cE '^\s*-\s*\[ \]|<describe|<list|_TBD_|_TODO_' "$full" 2>/dev/null || true)
+  placeholder_count="${placeholder_count:-0}"
+  placeholder_count=$(echo "$placeholder_count" | tr -d '[:space:]')
+  local total_lines
+  total_lines=$(wc -l < "$full" | tr -d ' ')
+
+  if [ "$placeholder_count" -gt 3 ] && [ "$total_lines" -lt 30 ]; then
+    fail "PLANS.md — appears to be raw template ($placeholder_count placeholders in $total_lines lines)"
+    return
+  fi
+
+  pass "PLANS.md customized"
+}
+
+check_smoke_is_real() {
+  local full="$target_path/scripts/harness/smoke.sh"
+
+  if [ ! -f "$full" ]; then
+    fail "smoke.sh — file missing"
+    return
+  fi
+
+  if grep -qE 'curl|wget|health|localhost|127\.0\.0\.1|uvicorn|openapi' "$full"; then
+    pass "smoke.sh — contains server lifecycle logic"
+  else
+    fail "smoke.sh — no server start/health check detected (might be a build-only stub)"
+  fi
+}
+
+check_tests_exist() {
+  local test_count=0
+
+  test_count=$((test_count + $(find "$target_path/backend" -name 'test_*.py' -o -name '*_test.py' 2>/dev/null | wc -l | tr -d ' ')))
+  test_count=$((test_count + $(find "$target_path/frontend" -name '*.test.ts' -o -name '*.spec.ts' -o -name '*.test.js' -o -name '*.spec.js' 2>/dev/null | wc -l | tr -d ' ')))
+  test_count=$((test_count + $(find "$target_path" -name '*_test.go' 2>/dev/null | wc -l | tr -d ' ')))
+  test_count=$((test_count + $(find "$target_path" -name '*Test*.cs' -o -name '*Tests*.cs' 2>/dev/null | wc -l | tr -d ' ')))
+
+  if [ "$test_count" -gt 0 ]; then
+    pass "Test files found ($test_count files)"
+  else
+    fail "No test files found"
+  fi
+}
+
+check_observability_wired() {
+  local hits=0
+
+  if [ -d "$target_path/backend/app" ]; then
+    hits=$(grep -rlE 'GTC_HARNESS_JSONL_ENABLED|\.harness/logs|\.harness/traces|trace_id|duration_ms' "$target_path/backend/app"               --include='*.py' 2>/dev/null | wc -l | tr -d ' ')
+  fi
+
+  if [ "$hits" -gt 0 ]; then
+    pass "Observability wired in application source ($hits files)"
+  else
+    fail "No harness JSONL observability wiring found in application source"
+  fi
+}
+
+echo "🔍 Verifying harness customization in: $target_path"
+echo ""
+
+echo "📄 Template Placeholders:"
+check_no_placeholders "AGENTS.md" "AGENTS.md"
+check_no_placeholders "docs/ARCHITECTURE.md" "docs/ARCHITECTURE.md"
+check_no_placeholders "docs/OBSERVABILITY.md" "docs/OBSERVABILITY.md"
+echo ""
+
+echo "📋 Content Depth:"
+check_not_empty_template "AGENTS.md" "AGENTS.md" 15
+check_not_empty_template "docs/ARCHITECTURE.md" "docs/ARCHITECTURE.md" 20
+check_not_empty_template "docs/OBSERVABILITY.md" "docs/OBSERVABILITY.md" 15
+check_plans_customized
+echo ""
+
+echo "🔧 Harness Scripts:"
+check_smoke_is_real
+check_tests_exist
+echo ""
+
+echo "📊 Observability:"
+check_observability_wired
+echo ""
+
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+if [ "$failures" -gt 0 ]; then
+  echo "❌ Verification failed: $failures/$total checks failed."
+  echo "   Fix the issues above before considering the harness complete."
+  exit 1
+fi
+
+echo "✅ All $total checks passed — harness is customized and wired."
diff --git a/wireframes/agent-curation-wireframe-v2.2.html b/wireframes/agent-curation-wireframe-v2.2.html
new file mode 100644
index 0000000..26ae11f
--- /dev/null
+++ b/wireframes/agent-curation-wireframe-v2.2.html
@@ -0,0 +1,2100 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>Ground Truth Curator — Agentic Curation Wireframe</title>
+<script src="https://cdn.tailwindcss.com"></script>
+<script>
+tailwind.config = {
+  theme: {
+    extend: {
+      colors: {
+        violet: {
+          50: '#f5f3ff', 100: '#ede9fe', 200: '#ddd6fe', 300: '#c4b5fd',
+          400: '#a78bfa', 500: '#8b5cf6', 600: '#7c3aed', 700: '#6d28d9',
+          800: '#5b21b6', 900: '#4c1d95',
+        }
+      }
+    }
+  }
+}
+</script>
+<style>
+  [x-cloak] { display: none !important; }
+  .line-clamp-1 { overflow: hidden; display: -webkit-box; line-clamp: 1; -webkit-line-clamp: 1; -webkit-box-orient: vertical; }
+  .line-clamp-2 { overflow: hidden; display: -webkit-box; line-clamp: 2; -webkit-line-clamp: 2; -webkit-box-orient: vertical; }
+  /* Custom scrollbar */
+  ::-webkit-scrollbar { width: 6px; }
+  ::-webkit-scrollbar-track { background: transparent; }
+  ::-webkit-scrollbar-thumb { background: #c4b5fd; border-radius: 3px; }
+
+  /* Resizable gutter for split pane */
+  .gutter { cursor: col-resize; user-select: none; transition: background 0.15s; }
+  .gutter:hover, .gutter.dragging { background: #7c3aed !important; }
+
+  /* Evidence drawer for mobile */
+  .evidence-drawer { transition: transform 0.3s cubic-bezier(0.4, 0, 0.2, 1); }
+  .evidence-drawer.closed { transform: translateX(100%); }
+
+  /* Cross-reference flash highlight */
+  @keyframes flash-highlight {
+    0% { background: #ede9fe; }
+    100% { background: transparent; }
+  }
+  .flash-ref { animation: flash-highlight 1.2s ease-out; }
+</style>
+</head>
+<body class="bg-gradient-to-b from-violet-50 via-white to-white text-slate-900 h-screen overflow-hidden">
+
+<div id="app"></div>
+
+<script>
+// ═══════════════════════════════════════════════════════════════
+// DUMMY DATA — Agentic Ground Truth Schema (ADR-045)
+// ═══════════════════════════════════════════════════════════════
+
+const AVAILABLE_TAGS = [
+  'domain:billing', 'domain:connectivity', 'domain:account', 'domain:device',
+  'intent:troubleshoot', 'intent:informational', 'intent:action-request',
+  'quality:high', 'quality:medium', 'quality:low',
+  'custom:needs-review', 'custom:edge-case',
+];
+
+function createDummyItems() {
+  return [
+    {
+      id: 'GT-0001',
+      datasetName: 'synthetic-agent-evals',
+      status: 'draft',
+      deleted: false,
+      schemaVersion: 'v3',
+      history: [
+        { role: 'user', msg: "Synthia cannot keep her learning tablet connected to the holo-wifi. She needs a clear reason and next step." },
+        { role: 'orchestrator-agent', msg: "The synthetic triage flow shows the wifi hub throttled traffic after repeated firmware retries. Recommend toggling adaptive mode and pausing background syncs." },
+        { role: 'output-agent', msg: "Status: Active. Device: HoloTab Mini 9. Connection: sandbox-wifi-42. Root cause: auto-updates saturating demo network. Resolution: enable adaptive mode + pause download queue." },
+      ],
+      groundingDataSummary: 'Fictional learner profile inside the demo sandbox where adaptive throttling activated during queued updates.',
+      evaluationCriteria: [
+        { agent: 'orchestrator-agent', contains: ['adaptive mode', 'demo network'], notContains: ['live customer data'], semanticInstructions: 'Keep explanations sandbox-only and call out throttling clearly.' },
+        { agent: 'output-agent', contains: ['Root cause', 'Resolution'], notContains: ['ticket'], semanticInstructions: 'Use concise paragraphs with neutral tone.' },
+      ],
+      manualTags: ['demo:connectivity', 'priority:medium'],
+      computedTags: ['quality:placeholder'],
+      comment: 'Synthetic QA sample for UI smoke tests.',
+      metadata: {
+        traceIds: { conversationId: 'SYNTH-CONVO-0001', sessionId: 'SYNTH-SESSION-0001' },
+        userFeedback: { issueResolved: true, feedbackText: 'Placeholder thumbs up', rating: 5 },
+        traceSource: { agentVersion: 'sandbox-bot-v0.3', environment: 'demo-sandbox' },
+      },
+      trace: {
+        id: 'SYNTH-TRACE-0001',
+        cid_list: ['SYNTH-CID-0001'],
+        uid: 'demo-user-0001',
+        impacted_device_type: 'virtual-line',
+        impacted_device: 'demo-line-0001',
+        metric_name: 'demo feedback',
+        type: 'praise',
+        comment: 'All data below is fictional for UI testing.',
+        additional_feedback: {
+          'Resolution clarity': 2,
+          'Steps reproducible': 2,
+          'Tone friendly': 2,
+          'Data sufficient': 2,
+        },
+        resolution: 'Learner resumed class after adaptive toggle.',
+        feedback_date: 1771405033,
+        feedback_datetime_utc: '2026-02-18T08:57:13+00:00',
+        chat_history: [{
+          user_query: 'Tablet keeps dropping off demo wifi.',
+          chat_response: 'Adaptive throttling engaged after queued downloads. Pause the sync queue, reboot the tablet, and resume one update at a time.',
+          context: [
+            { id: 'gt1-tool-01', run_id: 'SYNTH-RUN-0001', function_name: 'demo_signal_probe', function_arguments: "subject='demo-line-0001'", function_result: '{"signal":"stable","latencyMs":32}', execution_time: '0.42' },
+            { id: 'gt1-tool-02', run_id: 'SYNTH-RUN-0001', function_name: 'demo_plan_lookup', function_arguments: "subject='demo-line-0001'", function_result: '{"planTier":"adaptive-lite","throttleAfterMb":5120}', execution_time: '0.35' },
+            { id: 'gt1-tool-03', run_id: 'SYNTH-RUN-0001', function_name: 'demo_usage_summary', function_arguments: "window='7d'", function_result: '{"usageMb":5120,"peaks":["19:00Z"]}', execution_time: '0.18' },
+          ],
+          rca: 'Throttle triggered because queued updates saturated the sandbox router.',
+        }],
+      },
+    },
+    {
+      id: 'GT-0002',
+      datasetName: 'synthetic-agent-evals',
+      status: 'in-review',
+      deleted: false,
+      schemaVersion: 'v3',
+      history: [
+        { role: 'user', msg: "Demo resident wonders why auto-pay emailed a warning even though credits remain." },
+        { role: 'orchestrator-agent', msg: "Ledger review shows the autopay run occurred during a sandbox maintenance window; the dry-run snapshot momentarily showed zero credits even though the live pull succeeded." },
+        { role: 'output-agent', msg: "Account: Synth-Residency-447. Auto-pay pipeline: delayed < 1 min. Action: none; send reassurance explaining timing." },
+      ],
+      groundingDataSummary: 'Sandbox wallet triggered a caution email because a dry-run saw stale ledger data for forty seconds.',
+      evaluationCriteria: [
+        { agent: 'orchestrator-agent', contains: ['maintenance window'], notContains: ['escalation'], semanticInstructions: 'Highlight synthetic timing offsets only.' },
+        { agent: 'output-agent', contains: ['Action'], notContains: ['panic'], semanticInstructions: 'Deliver precise reassurance with simple status labels.' },
+      ],
+      manualTags: ['demo:billing', 'priority:low'],
+      computedTags: ['quality:placeholder'],
+      comment: 'Synthetic ledger scenario; public-safe demo data.',
+      metadata: {
+        traceIds: { conversationId: 'SYNTH-CONVO-0002', sessionId: 'SYNTH-SESSION-0002' },
+        userFeedback: { issueResolved: true, feedbackText: 'Thanks for confirming auto-pay!', rating: 5 },
+        traceSource: { agentVersion: 'sandbox-bot-v0.3', environment: 'demo-sandbox' },
+      },
+      trace: {
+        id: 'SYNTH-TRACE-0002',
+        cid_list: ['SYNTH-CID-0002'],
+        uid: 'demo-user-0002',
+        impacted_device_type: 'virtual-ledger',
+        impacted_device: 'demo-wallet-0002',
+        metric_name: 'demo feedback',
+        type: 'info',
+        comment: 'Fictional entries for UI only.',
+        additional_feedback: {
+          'Resolution clarity': 2,
+          'Steps reproducible': 1,
+          'Tone friendly': 2,
+          'Data sufficient': 2,
+        },
+        resolution: 'Explained that the warning was informational.',
+        feedback_date: 1771500000,
+        feedback_datetime_utc: '2026-02-19T08:00:00+00:00',
+        chat_history: [{
+          user_query: 'Auto-pay warning even though credits exist.',
+          chat_response: 'Maintenance window triggered a temporary warning; payment succeeded automatically.',
+          context: [
+            { id: 'gt2-tool-01', run_id: 'SYNTH-RUN-0002', function_name: 'demo_ledger_snapshot', function_arguments: "account='Synth-Residency-447'", function_result: '{"balanceCredits":120,"status":"posted"}', execution_time: '0.29' },
+            { id: 'gt2-tool-02', run_id: 'SYNTH-RUN-0002', function_name: 'demo_autopay_audit', function_arguments: "account='Synth-Residency-447'", function_result: '{"attemptTimestamp":"2026-02-19T07:59:00Z","state":"completed"}', execution_time: '0.31' },
+            { id: 'gt2-tool-03', run_id: 'SYNTH-RUN-0002', function_name: 'demo_notification_log', function_arguments: "type='warning'", function_result: '{"emailsSent":1,"reason":"maintenance-dry-run"}', execution_time: '0.22' },
+          ],
+          rca: 'Dry-run snapshot lacked the latest ledger row, so the notifier erred on the side of caution.',
+        }],
+      },
+    },
+    {
+      id: 'GT-0003',
+      datasetName: 'synthetic-agent-evals',
+      status: 'draft',
+      deleted: false,
+      schemaVersion: 'v3',
+      history: [
+        { role: 'user', msg: "Demo curator asks whether the SwitchFlex plan keeps timeline editing after migrating seats." },
+        { role: 'orchestrator-agent', msg: "Synthetic entitlement grid shows SwitchFlex retains editing but requires a nightly re-index of references before approval resumes." },
+        { role: 'output-agent', msg: "Summary: SwitchFlex keeps timeline editing. Action: schedule autop migration and review re-index report in the morning." },
+      ],
+      groundingDataSummary: 'SwitchFlex is a fictional plan used purely for UI training; migration pauses reference lookups for one job cycle.',
+      evaluationCriteria: [
+        { agent: 'orchestrator-agent', contains: ['re-index'], notContains: ['downtime'], semanticInstructions: 'Emphasize automation instead of manual lifts.' },
+        { agent: 'output-agent', contains: ['Action'], notContains: ['backlog'], semanticInstructions: 'Offer a single clear step in declarative voice.' },
+      ],
+      manualTags: ['demo:upgrade', 'priority:medium'],
+      computedTags: ['quality:placeholder'],
+      comment: 'Synthetic upgrade inquiry.',
+      metadata: {
+        traceIds: { conversationId: 'SYNTH-CONVO-0003', sessionId: 'SYNTH-SESSION-0003' },
+        userFeedback: { issueResolved: true, feedbackText: 'Great, thanks!', rating: 4 },
+        traceSource: { agentVersion: 'sandbox-bot-v0.3', environment: 'demo-sandbox' },
+      },
+      trace: {
+        id: 'SYNTH-TRACE-0003',
+        cid_list: ['SYNTH-CID-0003'],
+        uid: 'demo-user-0003',
+        impacted_device_type: 'virtual-plan',
+        impacted_device: 'demo-seat-0003',
+        metric_name: 'demo feedback',
+        type: 'info',
+        comment: 'Entire payload fabricated for design discussion.',
+        additional_feedback: {
+          'Resolution clarity': 2,
+          'Steps reproducible': 2,
+          'Tone friendly': 2,
+          'Data sufficient': 1,
+        },
+        resolution: 'Migration scheduled with reminder to verify re-index job.',
+        feedback_date: 1771600000,
+        feedback_datetime_utc: '2026-02-20T08:00:00+00:00',
+        chat_history: [{
+          user_query: 'Does SwitchFlex keep editing?',
+          chat_response: 'Yes. A nightly re-index syncs references, so expect a brief read-only window.',
+          context: [
+            { id: 'gt3-tool-01', run_id: 'SYNTH-RUN-0003', function_name: 'demo_plan_matrix', function_arguments: "plan='SwitchFlex'", function_result: '{"supportsTimelineEditing":true}', execution_time: '0.26' },
+            { id: 'gt3-tool-02', run_id: 'SYNTH-RUN-0003', function_name: 'demo_usage_heatmap', function_arguments: "plan='SwitchFlex'", function_result: '{"activeSeats":42,"inactiveSeats":3}', execution_time: '0.30' },
+            { id: 'gt3-tool-03', run_id: 'SYNTH-RUN-0003', function_name: 'demo_contract_rules', function_arguments: "plan='SwitchFlex'", function_result: '{"reindexHours":["02:00Z"]}', execution_time: '0.21' },
+          ],
+          rca: 'Re-index job temporarily locks editing, but retains access afterward.',
+        }],
+      },
+    },
+    {
+      id: 'GT-0004',
+      datasetName: 'synthetic-curation-lab',
+      status: 'draft',
+      deleted: false,
+      schemaVersion: 'v3',
+      history: [
+        { role: 'user', msg: "Placeholder supervisor wants a refund summary for a training session that was never charged." },
+        { role: 'orchestrator-agent', msg: "Sandbox ledger shows no invoice; respond with a synthetic explanation and mark as informational." },
+        { role: 'output-agent', msg: "No charge found in sandbox ledger. Action: reassure supervisor that no refund is necessary." },
+      ],
+      groundingDataSummary: 'Fictional supervisor scenario reminding curators to verify ledgers before promising refunds.',
+      evaluationCriteria: [
+        { agent: 'orchestrator-agent', contains: ['no invoice'], notContains: ['real account'], semanticInstructions: 'Keep it obviously hypothetical.' },
+      ],
+      manualTags: ['demo:refund'],
+      computedTags: [],
+      comment: 'Synthetic reminder card.',
+      metadata: {
+        traceIds: { conversationId: 'SYNTH-CONVO-0004', sessionId: 'SYNTH-SESSION-0004' },
+        userFeedback: null,
+        traceSource: { agentVersion: 'sandbox-bot-v0.3', environment: 'demo-sandbox' },
+      },
+      trace: {
+        id: 'SYNTH-TRACE-0004',
+        cid_list: ['SYNTH-CID-0004'],
+        uid: 'demo-user-0004',
+        impacted_device_type: 'virtual-ledger',
+        impacted_device: 'demo-ledger-0004',
+        metric_name: 'demo feedback',
+        type: 'note',
+        comment: 'Synthetic placeholder trace.',
+        additional_feedback: {},
+        resolution: 'Explained nothing was billed.',
+        feedback_date: 1771700000,
+        feedback_datetime_utc: '2026-02-21T08:00:00+00:00',
+        chat_history: [{
+          user_query: 'Need refund?',
+          chat_response: 'No billing event exists in the sandbox ledger, so a refund is unnecessary.',
+          context: [
+            { id: 'gt4-tool-01', run_id: 'SYNTH-RUN-0004', function_name: 'demo_invoice_lookup', function_arguments: "invoice='demo-void-0004'", function_result: '{"found":false}', execution_time: '0.19' },
+          ],
+          rca: 'Ledger empty; refund request purely informational.',
+        }],
+      },
+    },
+    {
+      id: 'GT-0005',
+      datasetName: 'synthetic-curation-lab',
+      status: 'draft',
+      deleted: true,
+      schemaVersion: 'v3',
+      history: [
+        { role: 'user', msg: "Archived example showing how deleted items appear in the UI." },
+        { role: 'orchestrator-agent', msg: "Marking for deletion to simulate moderation clean-up." },
+        { role: 'output-agent', msg: "Record hidden from assignment queue; no action required." },
+      ],
+      groundingDataSummary: 'Demonstrates UI for deleted rows.',
+      evaluationCriteria: [],
+      manualTags: [],
+      computedTags: [],
+      comment: 'Tombstone sample.',
+      metadata: {
+        traceIds: { conversationId: 'SYNTH-CONVO-0005', sessionId: 'SYNTH-SESSION-0005' },
+        userFeedback: null,
+        traceSource: { agentVersion: 'sandbox-bot-v0.3', environment: 'demo-sandbox' },
+      },
+      trace: {
+        id: 'SYNTH-TRACE-0005',
+        cid_list: ['SYNTH-CID-0005'],
+        uid: 'demo-user-0005',
+        impacted_device_type: 'virtual-line',
+        impacted_device: 'demo-line-0005',
+        metric_name: 'demo feedback',
+        type: 'note',
+        comment: 'Deleted example with minimal content.',
+        additional_feedback: {},
+        resolution: 'Record hidden.',
+        feedback_date: 1771800000,
+        feedback_datetime_utc: '2026-02-22T08:00:00+00:00',
+        chat_history: [{
+          user_query: 'Why is this deleted?',
+          chat_response: 'This sample was removed to keep the queue focused on newer artifacts.',
+          context: [],
+          rca: 'Intentional archival.',
+        }],
+      },
+    },
+    {
+      id: 'GT-0006',
+      datasetName: 'synthetic-agent-evals',
+      status: 'ready',
+      deleted: false,
+      schemaVersion: 'v3',
+      history: [
+        { role: 'user', msg: "Sample duplicate of GT-0001 to show comparison tooling." },
+        { role: 'orchestrator-agent', msg: "Sandbox duplication detected; keep entry for reviewers to practice merging decisions." },
+        { role: 'output-agent', msg: "Marked as duplicate. Action: reference GT-0001 reasoning and keep metadata aligned." },
+      ],
+      groundingDataSummary: 'Shows how duplicates inherit context while remaining obviously synthetic.',
+      evaluationCriteria: [
+        { agent: 'orchestrator-agent', contains: ['duplicate'], notContains: ['trace labels'], semanticInstructions: 'Point reviewers to the canonical record.' },
+      ],
+      manualTags: ['demo:duplicate'],
+      computedTags: ['quality:placeholder'],
+      comment: 'Duplicate of GT-0001 within sandbox.',
+      metadata: {
+        traceIds: { conversationId: 'SYNTH-CONVO-0006', sessionId: 'SYNTH-SESSION-0006' },
+        userFeedback: null,
+        traceSource: { agentVersion: 'sandbox-bot-v0.3', environment: 'demo-sandbox' },
+      },
+      trace: {
+        id: 'SYNTH-TRACE-0006',
+        cid_list: ['SYNTH-CID-0006'],
+        uid: 'demo-user-0006',
+        impacted_device_type: 'virtual-line',
+        impacted_device: 'demo-line-0006',
+        metric_name: 'demo feedback',
+        type: 'note',
+        comment: 'Duplicate practice item.',
+        additional_feedback: {},
+        resolution: 'Linked to GT-0001.',
+        feedback_date: 1771900000,
+        feedback_datetime_utc: '2026-02-23T08:00:00+00:00',
+        chat_history: [{
+          user_query: 'Why duplicate?',
+          chat_response: 'Maintains parity with canonical sandbox case for reviewer drills.',
+          context: [
+            { id: 'gt6-tool-01', run_id: 'SYNTH-RUN-0006', function_name: 'demo_duplicate_scan', function_arguments: "candidate='GT-0006'", function_result: '{"match":"GT-0001","confidence":0.98}', execution_time: '0.25' },
+            { id: 'gt6-tool-02', run_id: 'SYNTH-RUN-0006', function_name: 'demo_metadata_compare', function_arguments: "left='GT-0001' right='GT-0006'", function_result: '{"fieldsAligned":true}', execution_time: '0.27' },
+            { id: 'gt6-tool-03', run_id: 'SYNTH-RUN-0006', function_name: 'demo_resolution_log', function_arguments: "source='duplicate-lab'", function_result: '{"action":"link-only"}', execution_time: '0.19' },
+          ],
+          rca: 'Duplicate retained for reviewer practice only.',
+        }],
+      },
+    },
+  ];
+}
+
+const DEFAULT_REQUIRED_TOOL = {
+  'GT-0001': 'gt1-tool-02',
+  'GT-0002': 'gt2-tool-02',
+  'GT-0003': 'gt3-tool-02',
+  'GT-0006': 'gt6-tool-02',
+};
+
+function getTraceContextEntries(item) {
+  const trace = item && item.trace;
+  return trace && trace.chat_history && trace.chat_history[0] ? (trace.chat_history[0].context || []) : [];
+}
+
+function hydrateItem(item) {
+  item.userContextEntries = Array.isArray(item.userContextEntries)
+    ? item.userContextEntries
+    : (DEFAULT_USER_CONTEXT[item.id] ? DEFAULT_USER_CONTEXT[item.id].map(entry => ({ ...entry })) : []);
+
+  item.toolCallDecisions = item.toolCallDecisions ? { ...item.toolCallDecisions } : {};
+  const contextEntries = getTraceContextEntries(item);
+  for (const tc of contextEntries) {
+    if (!item.toolCallDecisions[tc.id]) item.toolCallDecisions[tc.id] = 'optional';
+  }
+
+  const requiredIds = Object.entries(item.toolCallDecisions)
+    .filter(([, decision]) => decision === 'required')
+    .map(([id]) => id);
+
+  if (requiredIds.length === 0 && DEFAULT_REQUIRED_TOOL[item.id]) {
+    item.toolCallDecisions[DEFAULT_REQUIRED_TOOL[item.id]] = 'required';
+  }
+
+  return item;
+}
+
+const DEFAULT_USER_CONTEXT = {
+  'GT-0001': [
+    { key: 'customer_segment', value: 'synthetic individual' },
+    { key: 'issue_type', value: 'connectivity demo' },
+  ],
+  'GT-0002': [
+    { key: 'plan_tier', value: 'sandbox autopay' },
+    { key: 'symptom_window', value: 'maintenance' },
+  ],
+  'GT-0003': [
+    { key: 'customer_intent', value: 'plan upgrade practice' },
+  ],
+};
+
+const TOOL_CALL_LAYOUTS = {
+  'GT-0001': {
+    'gt1-tool-01': { order: 1 },
+    'gt1-tool-02': { order: 2 },
+    'gt1-tool-03': { order: 3 },
+  },
+  'GT-0002': {
+    'gt2-tool-01': { order: 1 },
+    'gt2-tool-02': { order: 2 },
+    'gt2-tool-03': { order: 3 },
+  },
+  'GT-0003': {
+    'gt3-tool-01': { order: 1 },
+    'gt3-tool-02': { order: 2 },
+    'gt3-tool-03': { order: 3 },
+  },
+  'GT-0006': {
+    'gt6-tool-01': { order: 1 },
+    'gt6-tool-02': { order: 2 },
+    'gt6-tool-03': { order: 3 },
+  },
+};
+
+// ═══════════════════════════════════════════════════════════════
+// STATE
+// ═══════════════════════════════════════════════════════════════
+
+let state = {
+  items: createDummyItems().map(hydrateItem),
+  selectedId: 'GT-0001',
+  viewMode: 'curate',
+  sidebarOpen: true,
+  saving: false,
+  // Modal state
+  tagsModalOpen: false,
+  toolCallsExpanded: true,
+  // Toasts
+  toasts: [],
+  toastCounter: 0,
+  // Collapsed turns
+  collapsedTurns: new Set(),
+  // Editing turns
+  editingTurns: new Map(),
+  // Expanded tool call details
+  expandedToolCalls: new Set(),
+  // Evidence drawer for mobile split pane
+  evidenceDrawerOpen: false,
+  // Editor modal
+  editorOpen: false,
+  editorPrompt: '',
+  editorSuggestions: [],
+  editorManualMode: false,
+  editorManualTarget: 'orchestrator-agent',
+  editorManualText: '',
+};
+
+function getCurrent() {
+  return state.items.find(i => i.id === state.selectedId) || null;
+}
+
+function toast(kind, msg, duration = 3000) {
+  const id = ++state.toastCounter;
+  state.toasts.push({ id, kind, msg });
+  setTimeout(() => {
+    state.toasts = state.toasts.filter(t => t.id !== id);
+    render();
+  }, duration);
+  render();
+}
+
+// ═══════════════════════════════════════════════════════════════
+// VALIDATION
+// ═══════════════════════════════════════════════════════════════
+
+function validateHistory(history) {
+  if (!history || history.length === 0) return { valid: false, errors: ['Conversation must have at least one turn'] };
+  const errors = [];
+  if (history[0].role !== 'user') errors.push('First turn must be a user message');
+  const roles = history.map(t => t.role);
+  if (!roles.includes('output-agent')) errors.push('Missing output-agent response');
+  if (!roles.includes('orchestrator-agent')) errors.push('Missing orchestrator-agent response');
+  for (const t of history) {
+    if (!t.msg || t.msg.trim() === '') errors.push(`Empty message for ${t.role}`);
+  }
+  return { valid: errors.length === 0, errors };
+}
+
+function canApprove(item) {
+  if (!item || item.deleted) return false;
+  const hist = item.history || [];
+  const hv = validateHistory(hist);
+  if (!hv.valid) return false;
+  return validateRequiredToolSelection(item).valid;
+}
+
+function getToolCallDecision(item, callId) {
+  return item && item.toolCallDecisions && item.toolCallDecisions[callId]
+    ? item.toolCallDecisions[callId]
+    : 'optional';
+}
+
+function getOrderedToolCalls(item) {
+  return getTraceContextEntries(item)
+    .map((tc, index) => {
+      const layout = (TOOL_CALL_LAYOUTS[item.id] && TOOL_CALL_LAYOUTS[item.id][tc.id]) || {};
+      return {
+        ...tc,
+        order: layout.order || (index + 1),
+        parallelGroup: layout.parallelGroup || null,
+        originalIndex: index,
+      };
+    })
+    .sort((left, right) => left.order - right.order || left.originalIndex - right.originalIndex);
+}
+
+function validateRequiredToolSelection(item) {
+  const orderedCalls = getOrderedToolCalls(item);
+  if (orderedCalls.length === 0) {
+    return { valid: false, error: 'Trace must contain at least one tool call' };
+  }
+
+  const requiredCount = orderedCalls.filter(tc => getToolCallDecision(item, tc.id) === 'required').length;
+  if (requiredCount >= 1) return { valid: true, error: '' };
+  return { valid: false, error: 'Mark at least one tool call as required' };
+}
+
+// ═══════════════════════════════════════════════════════════════
+// ACTIONS
+// ═══════════════════════════════════════════════════════════════
+
+function selectItem(id) {
+  state.selectedId = id;
+  state.collapsedTurns = new Set();
+  state.editingTurns = new Map();
+  state.expandedToolCalls = new Set();
+  state.toolCallsExpanded = true;
+  state.evidenceDrawerOpen = false;
+  state.editorOpen = false;
+  state.editorSuggestions = [];
+  state.editorManualMode = false;
+  render();
+}
+
+function updateTurnMsg(index, msg) {
+  const item = getCurrent();
+  if (!item) return;
+  item.history[index] = { ...item.history[index], msg };
+  render();
+}
+
+function setToolCallDecision(callId, decision) {
+  const item = getCurrent();
+  if (!item) return;
+  item.toolCallDecisions[callId] = decision;
+  render();
+}
+
+const DECISION_SEGMENTS = [
+  { value: 'required',   symbol: '★', label: '★ Required',   title: 'Needed to reach the correct answer',    activeClasses: 'bg-emerald-600 text-white shadow-sm', dotClasses: 'bg-emerald-600 text-white' },
+  { value: 'optional',   symbol: '○', label: '○ Optional',   title: 'Fine to call but not essential',         activeClasses: 'bg-sky-600 text-white shadow-sm',     dotClasses: 'bg-sky-600 text-white' },
+  { value: 'not-needed', symbol: '✕', label: '✕ Not needed',  title: 'Should not have been called',            activeClasses: 'bg-rose-600 text-white shadow-sm',    dotClasses: 'bg-rose-600 text-white' },
+];
+
+function renderMiniDecisionBadge(callId, selectedDecision) {
+  const seg = DECISION_SEGMENTS.find(s => s.value === selectedDecision) || DECISION_SEGMENTS[1];
+  return h('span', {
+    className: `inline-flex items-center justify-center w-5 h-5 rounded-full text-xs font-bold ${seg.dotClasses}`,
+    title: seg.title,
+    onclick: (e) => e.stopPropagation(),
+  }, seg.symbol);
+}
+
+function renderSegmentedToggle(callId, selectedDecision) {
+  const segments = DECISION_SEGMENTS;
+
+  const wrapper = h('div', {
+    className: 'inline-flex rounded-lg border border-slate-200 bg-slate-100 p-0.5',
+    role: 'radiogroup',
+    'aria-label': 'Tool call relevance',
+  });
+
+  for (const seg of segments) {
+    const isActive = selectedDecision === seg.value;
+    wrapper.appendChild(h('button', {
+      type: 'button',
+      className: `relative select-none rounded-md px-3 py-1.5 text-xs font-semibold transition-all duration-150 ${
+        isActive
+          ? seg.activeClasses
+          : 'text-slate-500 hover:text-slate-700'
+      }`,
+      'aria-pressed': isActive ? 'true' : 'false',
+      title: seg.title,
+      onclick: () => setToolCallDecision(callId, seg.value),
+    }, seg.label));
+  }
+
+  return wrapper;
+}
+
+function addUserContextEntry() {
+  const item = getCurrent();
+  if (!item) return;
+  item.userContextEntries.push({ key: '', value: '' });
+  render();
+}
+
+function updateUserContextEntry(index, field, value) {
+  const item = getCurrent();
+  if (!item || !item.userContextEntries[index]) return;
+  item.userContextEntries[index][field] = value;
+  render();
+}
+
+function removeUserContextEntry(index) {
+  const item = getCurrent();
+  if (!item) return;
+  item.userContextEntries.splice(index, 1);
+  render();
+}
+
+function openEditor() {
+  const item = getCurrent();
+  if (!item) return;
+  const orchestratorTurn = (item.history || []).find(turn => turn.role === 'orchestrator-agent');
+  state.editorOpen = true;
+  state.editorPrompt = '';
+  state.editorSuggestions = [];
+  state.editorManualMode = false;
+  state.editorManualTarget = 'orchestrator-agent';
+  state.editorManualText = orchestratorTurn ? orchestratorTurn.msg : '';
+  render();
+}
+
+function closeEditor() {
+  state.editorOpen = false;
+  state.editorPrompt = '';
+  state.editorSuggestions = [];
+  state.editorManualMode = false;
+  state.editorManualTarget = 'orchestrator-agent';
+  state.editorManualText = '';
+  render();
+}
+
+function buildEditorSuggestions(item, prompt) {
+  const lowerPrompt = prompt.trim().toLowerCase();
+  const suggestions = [];
+  const orderedCalls = getOrderedToolCalls(item);
+
+  if (lowerPrompt.includes('required') || lowerPrompt.includes('tool')) {
+    const targetTool = orderedCalls.find(tc => /billing|outage|cellsector|comparison/i.test(tc.function_name)) || orderedCalls[0];
+    if (targetTool) {
+      suggestions.push({
+        type: 'toolDecision',
+        label: 'Required tool call',
+        callId: targetTool.id,
+        functionName: targetTool.function_name,
+        value: 'required',
+      });
+    }
+  }
+
+  if (lowerPrompt.includes('context')) {
+    suggestions.push({
+      type: 'userContext',
+      label: 'User context entry',
+      key: 'operator_note',
+      value: prompt.trim() || 'Needs curator review',
+    });
+  }
+
+  if (lowerPrompt.includes('output')) {
+    const turnIndex = (item.history || []).findIndex(turn => turn.role === 'output-agent');
+    if (turnIndex >= 0) {
+      suggestions.push({
+        type: 'turnEdit',
+        label: 'Output agent response',
+        turnIndex,
+        role: 'output-agent',
+        value: item.history[turnIndex].msg + '\n\nClarified per curator editor request.',
+      });
+    }
+  }
+
+  if (lowerPrompt.includes('orchestrator') || suggestions.length === 0) {
+    const turnIndex = (item.history || []).findIndex(turn => turn.role === 'orchestrator-agent');
+    if (turnIndex >= 0) {
+      suggestions.push({
+        type: 'turnEdit',
+        label: 'Orchestrator response',
+        turnIndex,
+        role: 'orchestrator-agent',
+        value: item.history[turnIndex].msg + '\n\nCurator edit pending confirmation.',
+      });
+    }
+  }
+
+  return suggestions;
+}
+
+function generateEditorSuggestions() {
+  const item = getCurrent();
+  if (!item) return;
+  state.editorSuggestions = buildEditorSuggestions(item, state.editorPrompt);
+  if (state.editorSuggestions.length === 0) {
+    toast('info', 'No editor suggestions generated');
+  }
+  render();
+}
+
+function updateEditorSuggestion(index, field, value) {
+  const suggestion = state.editorSuggestions[index];
+  if (!suggestion) return;
+  suggestion[field] = value;
+  render();
+}
+
+function applyEditorSuggestions() {
+  const item = getCurrent();
+  if (!item) return;
+
+  for (const suggestion of state.editorSuggestions) {
+    if (suggestion.type === 'toolDecision') {
+      setToolCallDecision(suggestion.callId, suggestion.value);
+      continue;
+    }
+
+    if (suggestion.type === 'userContext') {
+      item.userContextEntries.push({ key: suggestion.key, value: suggestion.value });
+      continue;
+    }
+
+    if (suggestion.type === 'turnEdit' && item.history[suggestion.turnIndex]) {
+      item.history[suggestion.turnIndex].msg = suggestion.value;
+    }
+  }
+
+  toast('success', 'Applied editor suggestions');
+  closeEditor();
+}
+
+function denyEditorSuggestions() {
+  state.editorSuggestions = [];
+  state.editorManualMode = false;
+  render();
+}
+
+function syncEditorManualTarget(target) {
+  const item = getCurrent();
+  state.editorManualTarget = target;
+  if (!item) {
+    state.editorManualText = '';
+    render();
+    return;
+  }
+  const turn = (item.history || []).find(entry => entry.role === target);
+  state.editorManualText = turn ? turn.msg : '';
+  render();
+}
+
+function applyManualEditorChange() {
+  const item = getCurrent();
+  if (!item) return;
+  const turnIndex = (item.history || []).findIndex(turn => turn.role === state.editorManualTarget);
+  if (turnIndex >= 0) {
+    item.history[turnIndex].msg = state.editorManualText;
+    toast('success', 'Manual edit applied');
+  }
+  render();
+}
+
+function saveDraft() {
+  const item = getCurrent();
+  if (!item) return;
+  state.saving = true;
+  render();
+  setTimeout(() => {
+    item.status = 'draft';
+    state.saving = false;
+    toast('success', `Saved ${item.id} as draft`);
+  }, 500);
+}
+
+function approveItem() {
+  const item = getCurrent();
+  if (!item || !canApprove(item)) return;
+  state.saving = true;
+  render();
+  setTimeout(() => {
+    item.status = 'approved';
+    state.saving = false;
+    toast('success', `Approved ${item.id}`);
+  }, 500);
+}
+
+function skipItem() {
+  const item = getCurrent();
+  if (!item) return;
+  item.status = 'skipped';
+  const idx = state.items.findIndex(i => i.id === item.id);
+  const next = idx < state.items.length - 1 ? state.items[idx + 1] : state.items[0];
+  state.selectedId = next.id;
+  toast('info', `Skipped ${item.id}`);
+}
+
+function deleteItem() {
+  const item = getCurrent();
+  if (!item) return;
+  item.deleted = true;
+  toast('info', `Marked ${item.id} as deleted`);
+}
+
+function restoreItem() {
+  const item = getCurrent();
+  if (!item) return;
+  item.deleted = false;
+  toast('success', `Restored ${item.id}`);
+}
+
+function duplicateItem() {
+  const item = getCurrent();
+  if (!item) return;
+  const newItem = JSON.parse(JSON.stringify(item));
+  newItem.id = 'GT-' + String(state.items.length + 1).padStart(4, '0');
+  newItem.status = 'draft';
+  newItem.deleted = false;
+  state.items.push(newItem);
+  state.selectedId = newItem.id;
+  toast('success', `Created duplicate ${newItem.id}`);
+}
+
+function updateComment(val) {
+  const item = getCurrent();
+  if (item) item.comment = val;
+  render();
+}
+
+function toggleEvidenceDrawer() {
+  state.evidenceDrawerOpen = !state.evidenceDrawerOpen;
+  render();
+}
+
+function toggleTag(tag) {
+  const item = getCurrent();
+  if (!item) return;
+  if (!item.manualTags) item.manualTags = [];
+  const idx = item.manualTags.indexOf(tag);
+  if (idx >= 0) item.manualTags.splice(idx, 1);
+  else item.manualTags.push(tag);
+  render();
+}
+
+// ═══════════════════════════════════════════════════════════════
+// RENDER HELPERS
+// ═══════════════════════════════════════════════════════════════
+
+function h(tag, attrs = {}, ...children) {
+  const el = document.createElement(tag);
+  if (tag === 'button' && !Object.prototype.hasOwnProperty.call(attrs, 'type')) {
+    el.type = 'button';
+  }
+  for (const [k, v] of Object.entries(attrs)) {
+    if (k === 'className') el.className = v;
+    else if (k === 'onclick' || k === 'onchange' || k === 'oninput' || k === 'onkeydown') el[k] = v;
+    else if (k === 'checked' || k === 'disabled' || k === 'selected') el[k] = Boolean(v);
+    else if (k === 'value') el.value = v;
+    else if (k === 'innerHTML') el.innerHTML = v;
+    else if (k === 'style' && typeof v === 'object') Object.assign(el.style, v);
+    else el.setAttribute(k, v);
+  }
+  for (const c of children) {
+    if (typeof c === 'string') el.appendChild(document.createTextNode(c));
+    else if (c) el.appendChild(c);
+  }
+  return el;
+}
+
+function statusBadge(status, deleted) {
+  const colors = {
+    draft: 'bg-amber-100 text-amber-900',
+    approved: 'bg-emerald-100 text-emerald-900',
+    skipped: 'bg-slate-200 text-slate-800',
+    deleted: 'bg-rose-100 text-rose-900',
+  };
+  const frag = document.createDocumentFragment();
+  frag.appendChild(h('span', { className: `rounded-full px-2 py-0.5 text-xs font-medium ${colors[status] || ''}` }, status));
+  if (deleted && status !== 'deleted') {
+    frag.appendChild(h('span', { className: 'ml-1 rounded-full bg-rose-100 px-2 py-0.5 text-xs font-medium text-rose-900' }, 'deleted'));
+  }
+  return frag;
+}
+
+function roleBadge(role) {
+  const config = {
+    'user': { label: 'User', bg: 'bg-blue-500 text-white' },
+    'output-agent': { label: 'Output Agent', bg: 'bg-amber-500 text-white' },
+    'orchestrator-agent': { label: 'Orchestrator Agent', bg: 'bg-violet-500 text-white' },
+  };
+  const c = config[role] || { label: role, bg: 'bg-slate-500 text-white' };
+  return h('span', { className: `rounded-full px-3 py-1 text-xs font-medium ${c.bg}` }, c.label);
+}
+
+function renderMarkdown(text) {
+  if (!text) return '<span class="text-slate-400 italic">(empty)</span>';
+  // Process line-by-line for block-level elements
+  const lines = text.split('\n');
+  const result = [];
+  for (const line of lines) {
+    if (/^### (.+)$/.test(line)) {
+      result.push(line.replace(/^### (.+)$/, '<h3 class="text-sm font-bold text-slate-800 mt-3 mb-1">$1</h3>'));
+    } else if (/^## (.+)$/.test(line)) {
+      result.push(line.replace(/^## (.+)$/, '<h2 class="text-base font-bold text-slate-800 mt-3 mb-1">$1</h2>'));
+    } else if (/^# (.+)$/.test(line)) {
+      result.push(line.replace(/^# (.+)$/, '<h1 class="text-lg font-bold text-slate-800 mt-4 mb-1">$1</h1>'));
+    } else if (/^- (.+)$/.test(line)) {
+      result.push(line.replace(/^- (.+)$/, '<div class="ml-4 flex items-start gap-1.5"><span class="text-slate-400 mt-0.5">•</span><span>$1</span></div>'));
+    } else if (/^\d+\. (.+)$/.test(line)) {
+      result.push(line.replace(/^(\d+)\. (.+)$/, '<div class="ml-4 flex items-start gap-1.5"><span class="font-mono text-slate-500 shrink-0">$1.</span><span>$2</span></div>'));
+    } else {
+      result.push(line || '<br>');
+    }
+  }
+  return result.join('\n')
+    .replace(/\*\*(.*?)\*\*/g, '<strong>$1</strong>')
+    .replace(/`([^`]+)`/g, '<code class="rounded bg-slate-200 px-1 py-0.5 text-xs font-mono">$1</code>')
+    .replace(/"([^"]+)"/g, '&ldquo;$1&rdquo;');
+}
+
+// ═══════════════════════════════════════════════════════════════
+// RENDER FUNCTIONS
+// ═══════════════════════════════════════════════════════════════
+
+function renderHeader() {
+  const header = h('header', { className: 'sticky top-0 z-10 border-b bg-white/80 backdrop-blur' });
+  const inner = h('div', { className: 'mx-auto flex w-full flex-wrap items-center gap-3 px-4 py-3' });
+
+  inner.appendChild(h('div', { className: 'text-xl font-semibold' }, 'Ground Truth Curator ',
+    h('span', { className: 'ml-2 rounded-full bg-amber-100 px-2 py-0.5 text-xs font-medium text-amber-900' }, 'AGENTIC WIREFRAME')
+  ));
+
+  const btns = h('div', { className: 'ml-auto flex flex-wrap items-center gap-2' });
+
+  // Evidence drawer toggle (mobile only)
+  if (window.innerWidth < 1024) {
+    btns.appendChild(h('button', {
+      className: 'rounded-xl border border-violet-300 px-3 py-1.5 text-sm font-medium text-violet-700 hover:bg-violet-50',
+      onclick: toggleEvidenceDrawer
+    }, '📋 Evidence'));
+  }
+
+  btns.appendChild(h('button', {
+    className: 'rounded-xl border px-3 py-1.5 text-sm hover:bg-violet-50',
+    onclick: () => { state.sidebarOpen = !state.sidebarOpen; render(); }
+  }, state.sidebarOpen ? 'Hide Sidebar' : 'Show Sidebar'));
+
+  btns.appendChild(h('button', {
+    className: 'rounded-xl border px-3 py-1.5 text-sm hover:bg-violet-50',
+    onclick: () => { state.viewMode = state.viewMode === 'curate' ? 'explorer' : 'curate'; render(); }
+  }, state.viewMode === 'curate' ? 'Explorer' : 'Back to Curation'));
+
+  btns.appendChild(h('button', {
+    className: 'rounded-xl border px-3 py-1.5 text-sm hover:bg-violet-50',
+    onclick: () => { state.viewMode = 'stats'; render(); }
+  }, 'Stats'));
+
+  inner.appendChild(btns);
+  header.appendChild(inner);
+  return header;
+}
+
+function renderQueueSidebar() {
+  const aside = h('aside', { className: 'self-start h-[calc(100vh-5.5rem)] rounded-2xl border bg-white p-3 shadow-sm flex flex-col overflow-hidden col-span-3' });
+
+  const headerRow = h('div', { className: 'mb-2 flex items-center justify-between' });
+  headerRow.appendChild(h('div', { className: 'font-medium' }, 'Queue'));
+  headerRow.appendChild(h('button', {
+    className: 'rounded-lg border px-2 py-1 text-xs hover:bg-violet-50',
+    onclick: () => toast('info', 'Refreshed queue')
+  }, '↻ Refresh'));
+  aside.appendChild(headerRow);
+
+  const list = h('div', { className: 'flex-1 min-h-0 overflow-auto pr-1 space-y-2' });
+
+  for (const item of state.items) {
+    const isSelected = state.selectedId === item.id;
+    const userMsg = (item.history || []).find(t => t.role === 'user');
+    const preview = userMsg ? userMsg.msg : '(no user message)';
+
+    const card = h('div', {
+      className: `w-full rounded-xl border p-3 text-left cursor-pointer hover:bg-violet-50 transition-colors ${isSelected ? 'border-violet-400 bg-violet-50' : ''} ${item.deleted ? 'opacity-60' : ''}`,
+      onclick: () => selectItem(item.id)
+    });
+
+    const top = h('div', { className: 'flex items-center justify-between gap-2' });
+    const left = h('div', { className: 'font-medium text-sm flex items-center gap-2 line-clamp-1' });
+    left.appendChild(h('span', {}, item.id));
+    const catTag = (item.computedTags || []).find(t => t.startsWith('category:'));
+    if (catTag) {
+      left.appendChild(h('span', { className: 'rounded-full bg-slate-100 px-2 py-0.5 text-xs text-slate-800' }, catTag.replace('category:', '')));
+    }
+    if (item.deleted) {
+      left.appendChild(h('span', { className: 'rounded-full bg-rose-100 px-2 py-0.5 text-xs text-rose-900' }, 'deleted'));
+    }
+    top.appendChild(left);
+
+    const badge = h('span', {});
+    badge.appendChild(statusBadge(item.status, false));
+    top.appendChild(badge);
+    card.appendChild(top);
+
+    card.appendChild(h('div', { className: 'truncate text-xs text-slate-600 mt-1' }, preview));
+    list.appendChild(card);
+  }
+
+  aside.appendChild(list);
+
+  aside.appendChild(h('div', { className: 'pt-2 mt-2 border-t' },
+    h('button', {
+      className: 'w-full rounded-xl border border-violet-300 bg-violet-600 px-3 py-2 text-sm text-white shadow hover:bg-violet-700',
+      onclick: () => toast('info', 'Self-serve: would request more items')
+    }, 'Request More (Self-serve)')
+  ));
+
+  return aside;
+}
+
+// ── Conversation Turn ──────────────────────────────────────────
+
+function renderConversationTurn(turn, index, item) {
+  const isUser = turn.role === 'user';
+  const isCollapsed = state.collapsedTurns.has(index);
+  const isEditing = state.editingTurns.has(index);
+
+  const borderColor = {
+    'user': 'border-blue-200 bg-blue-50',
+    'output-agent': 'border-amber-200 bg-amber-50',
+    'orchestrator-agent': 'border-violet-200 bg-violet-50',
+  }[turn.role] || 'border-slate-200 bg-slate-50';
+
+  const div = h('div', { className: `mb-3 rounded-xl border p-4 ${borderColor}` });
+
+  // Header row
+  const header = h('div', { className: 'mb-2 flex items-center justify-between flex-wrap gap-2' });
+  const leftGroup = h('div', { className: 'flex items-center gap-2 flex-wrap' });
+
+  leftGroup.appendChild(roleBadge(turn.role));
+
+  // Collapse/Expand
+  leftGroup.appendChild(h('button', {
+    className: 'ml-1 flex items-center gap-1 rounded-lg border border-slate-200 px-2 py-1 text-xs font-medium text-slate-600 hover:bg-slate-50',
+    onclick: () => { isCollapsed ? state.collapsedTurns.delete(index) : state.collapsedTurns.add(index); render(); }
+  }, isCollapsed ? '▸ Expand' : '▾ Collapse'));
+
+  header.appendChild(leftGroup);
+
+  // Edit buttons
+  const rightGroup = h('div', { className: 'flex items-center gap-1' });
+  if (!isEditing && !item.deleted) {
+    rightGroup.appendChild(h('button', {
+      className: 'flex items-center gap-1 rounded-lg border border-slate-200 px-2 py-1 text-xs font-medium text-slate-700 hover:bg-white',
+      onclick: () => { state.editingTurns.set(index, turn.msg); render(); }
+    }, '✏️ Edit'));
+  }
+  if (isEditing) {
+    rightGroup.appendChild(h('button', {
+      className: 'flex items-center gap-1 rounded-lg border border-emerald-200 px-2 py-1 text-xs font-medium text-emerald-700 hover:bg-emerald-50',
+      onclick: () => { updateTurnMsg(index, state.editingTurns.get(index)); state.editingTurns.delete(index); render(); }
+    }, '✓ Save'));
+    rightGroup.appendChild(h('button', {
+      className: 'flex items-center gap-1 rounded-lg border border-slate-200 px-2 py-1 text-xs font-medium text-slate-700 hover:bg-slate-50',
+      onclick: () => { state.editingTurns.delete(index); render(); }
+    }, '✕ Cancel'));
+  }
+  header.appendChild(rightGroup);
+  div.appendChild(header);
+
+  // Content (collapsible)
+  if (!isCollapsed) {
+    if (isEditing) {
+      const ta = h('textarea', {
+        className: 'w-full rounded-lg border border-slate-300 p-3 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+        rows: '6',
+        oninput: (e) => state.editingTurns.set(index, e.target.value)
+      });
+      ta.value = state.editingTurns.get(index) || '';
+      div.appendChild(ta);
+    } else {
+      div.appendChild(h('div', { className: 'rounded-lg px-1 py-0.5 text-sm', innerHTML: renderMarkdown(turn.msg) }));
+    }
+  }
+
+  return div;
+}
+
+// ── Trace Data Panel ───────────────────────────────────────────
+
+function renderTracePanel(item) {
+  const container = h('div', { className: 'rounded-2xl border bg-white shadow-sm' });
+  const trace = item.trace;
+  const contextEntries = getOrderedToolCalls(item);
+
+  // Header (always visible, acts as toggle)
+  const header = h('div', {
+    className: 'flex items-center justify-between p-4 cursor-pointer hover:bg-slate-50 rounded-2xl select-none',
+    onclick: () => { state.toolCallsExpanded = !state.toolCallsExpanded; render(); }
+  });
+
+  const headerLeft = h('div', { className: 'flex items-center gap-2 flex-wrap' });
+  headerLeft.appendChild(h('span', { className: 'text-sm font-medium text-slate-700' },
+    `📋 Trace Data (${contextEntries.length} tool calls)`
+  ));
+  if (trace) {
+    headerLeft.appendChild(h('span', {
+      className: `rounded-full px-2 py-0.5 text-xs font-medium ${trace.type === 'like' ? 'bg-emerald-100 text-emerald-800' : 'bg-rose-100 text-rose-800'}`
+    }, trace.type === 'like' ? '👍 Positive' : '👎 Negative'));
+  }
+  header.appendChild(headerLeft);
+  header.appendChild(h('span', { className: 'text-xs text-slate-500' }, state.toolCallsExpanded ? '▾ Collapse' : '▸ Expand'));
+  container.appendChild(header);
+
+  if (!state.toolCallsExpanded) return container;
+
+  const content = h('div', { className: 'border-t px-4 pb-4 space-y-3' });
+
+  if (!trace) {
+    content.appendChild(h('div', { className: 'rounded-xl border border-slate-200 bg-slate-50 p-4 text-center mt-3' },
+      h('p', { className: 'text-sm text-slate-600' }, 'No trace data available for this ground truth.')
+    ));
+    container.appendChild(content);
+    return container;
+  }
+
+  // Trace metadata summary
+  const metaDiv = h('div', { className: 'mt-3 rounded-xl border border-slate-200 bg-slate-50 p-3 space-y-1' });
+  metaDiv.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 uppercase tracking-wide mb-1' }, 'Trace Info'));
+  const metaGrid = h('div', { className: 'grid grid-cols-2 gap-x-4 gap-y-1 text-xs' });
+  metaGrid.appendChild(h('div', {},
+    h('span', { className: 'font-mono text-slate-500' }, 'trace_id: '),
+    h('span', { className: 'text-slate-700 font-mono' }, trace.id.substring(0, 18) + '…')
+  ));
+  metaGrid.appendChild(h('div', {},
+    h('span', { className: 'font-mono text-slate-500' }, 'device: '),
+    h('span', { className: 'text-slate-700' }, `${trace.impacted_device_type} ${trace.impacted_device}`)
+  ));
+  metaGrid.appendChild(h('div', {},
+    h('span', { className: 'font-mono text-slate-500' }, 'feedback: '),
+    h('span', { className: 'text-slate-700' }, trace.type)
+  ));
+  metaGrid.appendChild(h('div', {},
+    h('span', { className: 'font-mono text-slate-500' }, 'date: '),
+    h('span', { className: 'text-slate-700' }, trace.feedback_datetime_utc)
+  ));
+  if (trace.resolution) {
+    const resDiv = h('div', { className: 'col-span-2' });
+    resDiv.appendChild(h('span', { className: 'font-mono text-slate-500' }, 'resolution: '));
+    resDiv.appendChild(h('span', { className: 'text-slate-700' }, trace.resolution));
+    metaGrid.appendChild(resDiv);
+  }
+  metaDiv.appendChild(metaGrid);
+  content.appendChild(metaDiv);
+
+  // Additional feedback scores
+  if (trace.additional_feedback && Object.keys(trace.additional_feedback).length > 0) {
+    const fbDiv = h('div', { className: 'rounded-xl border border-slate-200 bg-slate-50 p-3' });
+    fbDiv.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 uppercase tracking-wide mb-1' }, 'Feedback Scores'));
+    for (const [question, score] of Object.entries(trace.additional_feedback)) {
+      const scoreColor = score <= 1 ? 'text-emerald-700' : score <= 2 ? 'text-amber-700' : 'text-rose-700';
+      fbDiv.appendChild(h('div', { className: 'flex items-center justify-between text-xs py-0.5' },
+        h('span', { className: 'text-slate-600 mr-2' }, question),
+        h('span', { className: `font-medium ${scoreColor}` }, String(score))
+      ));
+    }
+    fbDiv.appendChild(h('p', { className: 'mt-1 text-xs text-slate-400 italic' }, 'Scale: 1 = Strongly Agree, 5 = Strongly Disagree'));
+    content.appendChild(fbDiv);
+  }
+
+  // Tool calls from context
+  if (contextEntries.length === 0) {
+    content.appendChild(h('div', { className: 'rounded-xl border border-slate-200 bg-slate-50 p-4 text-center' },
+      h('p', { className: 'text-sm text-slate-600' }, 'No tool calls in this trace.')
+    ));
+  } else {
+    content.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 uppercase tracking-wide mt-2' }, `Tool Calls (${contextEntries.length})`));
+
+    for (const tc of contextEntries) {
+      const isExpanded = state.expandedToolCalls.has(tc.id);
+      const tcDiv = h('div', { className: 'mt-2 rounded-xl border border-slate-200 bg-white' });
+
+      // Tool call header — CSS Grid for consistent column alignment across rows
+      const selectedDecision = getToolCallDecision(item, tc.id);
+      const tcHeader = h('div', {
+        className: 'grid items-center p-3 cursor-pointer hover:bg-slate-50/50 rounded-xl gap-x-2',
+        style: 'grid-template-columns: 2rem minmax(0,1fr) 3.5rem 4rem 1.25rem 0.75rem;',
+        onclick: () => { isExpanded ? state.expandedToolCalls.delete(tc.id) : state.expandedToolCalls.add(tc.id); render(); }
+      });
+      tcHeader.appendChild(h('span', { className: 'rounded-full bg-violet-100 px-2 py-0.5 text-xs font-semibold text-violet-800 text-center' }, `#${tc.order}`));
+      tcHeader.appendChild(h('span', { className: 'rounded-lg bg-slate-700 px-2 py-0.5 text-xs font-mono text-white truncate', title: tc.function_name }, tc.function_name));
+      tcHeader.appendChild(tc.parallelGroup
+        ? h('span', { className: 'text-xs font-medium text-amber-600 text-center whitespace-nowrap' }, `‖ ${tc.parallelGroup}`)
+        : h('span'));
+      tcHeader.appendChild(h('span', { className: 'text-xs text-slate-400 text-right whitespace-nowrap tabular-nums' }, `${tc.execution_time}s`));
+      tcHeader.appendChild(renderMiniDecisionBadge(tc.id, selectedDecision));
+      tcHeader.appendChild(h('span', { className: 'text-xs text-slate-400 text-center' }, isExpanded ? '▾' : '▸'));
+      tcDiv.appendChild(tcHeader);
+
+      // Expanded detail
+      if (isExpanded) {
+        const decisionRow = h('div', { className: 'border-t border-slate-100 px-3 py-3 text-xs text-slate-700' });
+        decisionRow.appendChild(h('div', { className: 'mb-2 font-semibold uppercase tracking-wide text-slate-500' }, 'Was this call needed for the correct answer?'));
+        decisionRow.appendChild(renderSegmentedToggle(tc.id, selectedDecision));
+
+        const detail = h('div', { className: 'border-t p-3 space-y-2' });
+
+        // Arguments
+        detail.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 uppercase tracking-wide' }, 'Arguments'));
+        const argsPre = h('pre', { className: 'rounded-lg bg-slate-800 p-3 text-xs text-green-400 overflow-x-auto whitespace-pre-wrap' });
+        argsPre.textContent = tc.function_arguments;
+        detail.appendChild(argsPre);
+
+        // Result
+        detail.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 uppercase tracking-wide mt-2' }, 'Result'));
+        let resultText = tc.function_result;
+        try { resultText = JSON.stringify(JSON.parse(tc.function_result), null, 2); } catch(e) { /* use raw */ }
+        const resultPre = h('pre', { className: 'rounded-lg bg-slate-800 p-3 text-xs text-green-400 overflow-x-auto max-h-60 overflow-y-auto' });
+        resultPre.textContent = resultText;
+        detail.appendChild(resultPre);
+
+        tcDiv.appendChild(decisionRow);
+        tcDiv.appendChild(detail);
+      }
+
+      content.appendChild(tcDiv);
+    }
+  }
+
+  container.appendChild(content);
+  return container;
+}
+
+// ── Removed Legacy Criteria Pane ───────────────────────────────
+
+function renderEvaluationCriteria(item) {
+  return null;
+}
+
+// ── Conversation + Grounding ───────────────────────────────────
+
+function renderConversationEditor(item) {
+  const container = h('div', { className: 'rounded-2xl border bg-white p-4 shadow-sm' });
+  const history = item.history || [];
+
+  // Header
+  const headerRow = h('div', { className: 'mb-2 flex items-center justify-between' });
+  headerRow.appendChild(h('div', { className: 'text-sm font-medium text-slate-700' }, `Conversation (${history.length} turns)`));
+
+  const headerActions = h('div', { className: 'flex items-center gap-2' });
+  headerActions.appendChild(h('button', {
+    className: 'rounded-lg border border-violet-300 bg-violet-50 px-3 py-1.5 text-xs font-medium text-violet-700 hover:bg-violet-100',
+    onclick: openEditor,
+  }, 'Editor'));
+
+  if (history.length > 0) {
+    const hv = validateHistory(history);
+    const validBadge = h('div', { className: 'flex items-center gap-1.5 text-xs' });
+    if (hv.valid) {
+      validBadge.appendChild(h('span', { className: 'text-emerald-700' }, '✓ Valid'));
+    } else {
+      validBadge.appendChild(h('span', { className: 'text-amber-700', title: hv.errors.join('; ') }, `⚠ ${hv.errors.length} issue${hv.errors.length !== 1 ? 's' : ''}`));
+    }
+    headerActions.appendChild(validBadge);
+  }
+  headerRow.appendChild(headerActions);
+  container.appendChild(headerRow);
+
+  // User context key/value entries
+  const contextDiv = h('div', { className: 'mb-4 border-b border-slate-200 pb-4' });
+  const contextHeader = h('div', { className: 'mb-2 flex items-center justify-between gap-3' });
+  contextHeader.appendChild(h('div', {},
+    h('div', { className: 'text-sm font-medium text-slate-700' }, 'User Context'),
+    h('p', { className: 'text-xs text-slate-500 mt-1' }, 'Attach simple key/value context to this ground truth for later review or prompt construction.')
+  ));
+  contextHeader.appendChild(h('button', {
+    className: 'rounded-lg border border-violet-300 px-3 py-1.5 text-xs font-medium text-violet-700 hover:bg-violet-50',
+    onclick: addUserContextEntry,
+  }, '+ Add Entry'));
+  contextDiv.appendChild(contextHeader);
+
+  if (!item.userContextEntries.length) {
+    contextDiv.appendChild(h('div', { className: 'rounded-xl border border-slate-200 bg-slate-50 p-4 text-sm text-slate-500' }, 'No user context added yet.'));
+  } else {
+    const contextList = h('div', { className: 'space-y-2' });
+    for (let i = 0; i < item.userContextEntries.length; i++) {
+      const entry = item.userContextEntries[i];
+      const row = h('div', { className: 'grid grid-cols-1 gap-2 rounded-xl border border-slate-200 p-3 md:grid-cols-[1fr,1.5fr,auto]' });
+
+      row.appendChild(h('input', {
+        className: 'rounded-lg border border-slate-300 px-3 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+        value: entry.key,
+        placeholder: 'key',
+        oninput: (e) => updateUserContextEntry(i, 'key', e.target.value),
+      }));
+
+      row.appendChild(h('input', {
+        className: 'rounded-lg border border-slate-300 px-3 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+        value: entry.value,
+        placeholder: 'value',
+        oninput: (e) => updateUserContextEntry(i, 'value', e.target.value),
+      }));
+
+      row.appendChild(h('button', {
+        className: 'rounded-lg border border-rose-200 px-3 py-2 text-xs font-medium text-rose-700 hover:bg-rose-50',
+        onclick: () => removeUserContextEntry(i),
+      }, 'Remove'));
+
+      contextList.appendChild(row);
+    }
+    contextDiv.appendChild(contextList);
+  }
+  container.appendChild(contextDiv);
+
+  // Turns
+  if (history.length === 0) {
+    container.appendChild(h('div', { className: 'rounded-xl border border-slate-200 bg-slate-50 p-6 text-center' },
+      h('p', { className: 'text-sm text-slate-600' }, 'No conversation history. This item may need trace re-ingestion.')
+    ));
+  } else {
+    for (let i = 0; i < history.length; i++) {
+      container.appendChild(renderConversationTurn(history[i], i, item));
+    }
+  }
+
+  // Tags section
+  const tagsSection = h('div', { className: 'mt-4 border-t border-slate-200 pt-4' });
+  const tagsHeader = h('div', { className: 'mb-2 flex items-center justify-between' });
+  tagsHeader.appendChild(h('div', { className: 'text-sm font-medium text-slate-700' }, 'Tags'));
+  tagsHeader.appendChild(h('button', {
+    className: 'rounded-lg border border-amber-300 bg-white px-3 py-1.5 text-xs font-medium text-amber-700 hover:bg-amber-50',
+    onclick: () => { state.tagsModalOpen = true; render(); }
+  }, 'Manage Tags'));
+  tagsSection.appendChild(tagsHeader);
+
+  const tagsWrap = h('div', { className: 'flex flex-wrap gap-2' });
+  if (item.computedTags) {
+    for (const tag of item.computedTags) {
+      tagsWrap.appendChild(h('span', { className: 'inline-flex items-center gap-1 rounded-full bg-slate-100 border border-slate-200 px-2 py-0.5 text-xs text-slate-600' }, '🔒 ' + tag));
+    }
+  }
+  if (item.manualTags) {
+    for (const tag of item.manualTags) {
+      tagsWrap.appendChild(h('span', { className: 'rounded-full bg-violet-100 px-2 py-0.5 text-xs text-violet-800' }, tag));
+    }
+  }
+  if ((!item.computedTags || item.computedTags.length === 0) && (!item.manualTags || item.manualTags.length === 0)) {
+    tagsWrap.appendChild(h('p', { className: 'text-sm text-slate-500' }, 'No tags added yet'));
+  }
+  tagsSection.appendChild(tagsWrap);
+  container.appendChild(tagsSection);
+
+  return container;
+}
+
+// ── Metadata Panel ─────────────────────────────────────────────
+
+function renderMetadataPanel(item) {
+  const container = h('details', { className: 'rounded-2xl border bg-white shadow-sm' });
+  container.appendChild(h('summary', { className: 'p-4 cursor-pointer text-sm font-medium select-none' }, 'Metadata & Trace Info'));
+
+  const content = h('div', { className: 'px-4 pb-4 space-y-3' });
+  const meta = item.metadata || {};
+
+  // Trace IDs
+  if (meta.traceIds) {
+    const traceDiv = h('div', { className: 'rounded-lg border border-slate-200 bg-slate-50 p-3' });
+    traceDiv.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 mb-1' }, 'Trace IDs'));
+    for (const [k, v] of Object.entries(meta.traceIds)) {
+      traceDiv.appendChild(h('div', { className: 'text-xs text-slate-700' },
+        h('span', { className: 'font-mono text-slate-500' }, `${k}: `),
+        h('span', {}, v)
+      ));
+    }
+    content.appendChild(traceDiv);
+  }
+
+  // Trace Source
+  if (meta.traceSource) {
+    const srcDiv = h('div', { className: 'rounded-lg border border-slate-200 bg-slate-50 p-3' });
+    srcDiv.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 mb-1' }, 'Trace Source'));
+    for (const [k, v] of Object.entries(meta.traceSource)) {
+      srcDiv.appendChild(h('div', { className: 'text-xs text-slate-700' },
+        h('span', { className: 'font-mono text-slate-500' }, `${k}: `),
+        h('span', {}, v)
+      ));
+    }
+    content.appendChild(srcDiv);
+  }
+
+  // User Feedback
+  if (meta.userFeedback) {
+    const fb = meta.userFeedback;
+    const resolvedColor = fb.issueResolved ? 'border-emerald-200 bg-emerald-50' : 'border-amber-200 bg-amber-50';
+    const fbDiv = h('div', { className: `rounded-lg border p-3 ${resolvedColor}` });
+    fbDiv.appendChild(h('div', { className: 'text-xs font-semibold text-slate-600 mb-1' }, 'User Feedback (from care employee)'));
+    fbDiv.appendChild(h('div', { className: 'flex items-center gap-3 text-xs' },
+      h('span', { className: fb.issueResolved ? 'text-emerald-800' : 'text-amber-800' }, fb.issueResolved ? '✓ Resolved' : '✗ Unresolved'),
+      fb.rating ? h('span', { className: 'text-slate-600' }, '★'.repeat(fb.rating) + '☆'.repeat(5 - fb.rating)) : null
+    ));
+    if (fb.feedbackText) {
+      fbDiv.appendChild(h('p', { className: 'mt-1 text-xs text-slate-700 italic' }, `"${fb.feedbackText}"`));
+    }
+    content.appendChild(fbDiv);
+  } else {
+    content.appendChild(h('div', { className: 'rounded-lg border border-slate-200 bg-slate-50 p-3 text-xs text-slate-500 italic' }, 'No user feedback available.'));
+  }
+
+
+
+  container.appendChild(content);
+  return container;
+}
+
+// ── Main Curate Pane ───────────────────────────────────────────
+
+function renderCuratePane(item) {
+  const isNarrow = window.innerWidth < 1024;
+  const container = h('div', { className: 'flex h-[calc(100vh-5.5rem)] min-h-0' });
+
+  if (!item) {
+    container.appendChild(h('div', { className: 'flex h-full w-full items-center justify-center text-slate-400' }, 'Select an item from the queue'));
+    return container;
+  }
+
+  // ── LEFT PANE: Conversation & Curation ──
+  const leftPane = h('div', {
+    id: 'conversation-pane',
+    className: 'min-w-0 overflow-y-auto space-y-3 p-1',
+    style: { flex: '5' }
+  });
+
+  // Deleted banner
+  if (item.deleted) {
+    leftPane.appendChild(h('div', { className: 'rounded-2xl border border-rose-200 bg-rose-50 p-3 text-sm text-rose-900' },
+      'This ground truth is marked as deleted. You can restore it or leave it deleted.'
+    ));
+  }
+
+  // Conversation editor (includes user context and tags)
+  leftPane.appendChild(renderConversationEditor(item));
+
+  // Comments
+  const commentDiv = h('div', { className: 'rounded-2xl border bg-white p-4 shadow-sm' });
+  const commentHeader = h('div', { className: 'mb-1 flex items-center gap-2' });
+  commentHeader.appendChild(h('div', { className: 'text-sm font-medium' }, 'Curator Notes'));
+  commentHeader.appendChild(h('span', { className: 'ml-1 rounded-full border px-2 py-0.5 text-xs text-slate-500' }, 'Optional'));
+  commentDiv.appendChild(commentHeader);
+  commentDiv.appendChild(h('p', { className: 'mb-2 text-xs text-slate-500 leading-relaxed' },
+    'As an SME reviewing this ground truth, use this space to document your reasoning — for example: why you edited an agent response, what domain knowledge informed your corrections, caveats about edge cases or data quality, whether the trace is representative of typical scenarios, or flags for other curators to review.'
+  ));
+  const commentTa = h('textarea', {
+    className: 'h-20 w-full resize-y rounded-xl border p-3 focus:outline-none focus:ring-2 focus:ring-violet-300 text-sm',
+    placeholder: 'e.g. "Corrected orchestrator RCA — original missed that roaming on 2G/3G explains the slow speeds, not just plan overage. Verified against SOC B25D200T1 provisioning rules."',
+    oninput: (e) => updateComment(e.target.value)
+  });
+  commentTa.value = item.comment || '';
+  commentDiv.appendChild(commentTa);
+  leftPane.appendChild(commentDiv);
+
+  // Approval status
+  const approved = canApprove(item);
+  if (!approved) {
+    const warnDiv = h('div', { className: 'rounded-2xl border border-amber-200 bg-amber-50 p-4' });
+    warnDiv.appendChild(h('h3', { className: 'mb-2 text-sm font-semibold text-amber-900' }, '⚠️ Issues Preventing Approval'));
+    const issues = h('div', { className: 'space-y-2 text-sm text-amber-800' });
+
+    if (item.deleted) {
+      issues.appendChild(h('p', {}, '✗ Item is deleted: Restore before approving'));
+    }
+
+    const hv = validateHistory(item.history || []);
+    if (!hv.valid) {
+      for (const err of hv.errors) {
+        issues.appendChild(h('p', {}, `✗ ${err}`));
+      }
+    }
+
+    const toolValidation = validateRequiredToolSelection(item);
+    if (!toolValidation.valid) {
+      issues.appendChild(h('p', {}, `✗ ${toolValidation.error}`));
+    }
+
+    warnDiv.appendChild(issues);
+    leftPane.appendChild(warnDiv);
+  } else {
+    leftPane.appendChild(h('div', { className: 'rounded-2xl border border-emerald-200 bg-emerald-50 p-4' },
+      h('h3', { className: 'mb-1 text-sm font-semibold text-emerald-900' }, '✓ Ready for Approval'),
+      h('p', { className: 'text-sm text-emerald-800' }, 'Conversation is valid and at least one tool call is marked required.')
+    ));
+  }
+
+  // Action buttons
+  const actions = h('div', { className: 'flex items-center gap-2 flex-wrap' });
+
+  actions.appendChild(h('button', {
+    className: 'inline-flex items-center gap-2 rounded-2xl border bg-white px-4 py-2 hover:bg-violet-50 disabled:opacity-50 text-sm',
+    onclick: saveDraft
+  }, state.saving ? 'Saving…' : '💾 Save Draft'));
+
+  actions.appendChild(h('button', {
+    className: `inline-flex items-center gap-2 rounded-2xl border border-violet-300 bg-violet-600 px-4 py-2 text-white shadow hover:bg-violet-700 text-sm ${(!approved || item.deleted) ? 'opacity-50 cursor-not-allowed' : ''}`,
+    onclick: approved && !item.deleted ? approveItem : null
+  }, '✓ Approve'));
+
+  actions.appendChild(h('button', {
+    className: 'inline-flex items-center gap-2 rounded-2xl border border-slate-300 bg-white px-4 py-2 text-slate-700 hover:bg-slate-50 text-sm',
+    onclick: duplicateItem
+  }, '↻ Duplicate'));
+
+  actions.appendChild(h('button', {
+    className: 'inline-flex items-center gap-2 rounded-2xl border bg-white px-4 py-2 hover:bg-violet-50 text-sm',
+    onclick: skipItem
+  }, 'Skip'));
+
+  if (!item.deleted) {
+    actions.appendChild(h('button', {
+      className: 'ml-auto inline-flex items-center gap-2 rounded-2xl border border-rose-300 bg-white px-4 py-2 text-rose-700 hover:bg-rose-50 text-sm',
+      onclick: deleteItem
+    }, '🗑 Delete'));
+  } else {
+    actions.appendChild(h('button', {
+      className: 'ml-auto inline-flex items-center gap-2 rounded-2xl border border-emerald-300 bg-white px-4 py-2 text-emerald-700 hover:bg-emerald-50 text-sm',
+      onclick: restoreItem
+    }, '↻ Restore'));
+  }
+
+  leftPane.appendChild(actions);
+  container.appendChild(leftPane);
+
+  // ── SPLIT PANE: Gutter + Evidence (desktop) ──
+  if (!isNarrow) {
+    // Resizable gutter
+    const gutter = h('div', {
+      id: 'split-gutter',
+      className: 'gutter w-1 flex-none bg-slate-200',
+      title: 'Drag to resize'
+    });
+    container.appendChild(gutter);
+
+    // Right pane: Evidence & Metadata
+    const rightPane = h('div', {
+      id: 'evidence-pane',
+      className: 'overflow-y-auto space-y-3 p-1',
+      style: { flex: '5', minWidth: '300px' }
+    });
+
+    rightPane.appendChild(renderTracePanel(item));
+    rightPane.appendChild(renderMetadataPanel(item));
+    container.appendChild(rightPane);
+  }
+
+  return container;
+}
+
+// ── Explorer & Stats ───────────────────────────────────────────
+
+function renderExplorerView() {
+  const section = h('section', { className: 'flex flex-1 flex-col rounded-2xl border bg-white p-4 shadow-sm min-h-0 overflow-auto' });
+
+  section.appendChild(h('h2', { className: 'text-2xl font-bold text-slate-800 mb-2' }, 'Ground Truths Explorer'));
+  section.appendChild(h('p', { className: 'text-sm text-slate-600 mb-4' }, 'Explore agentic ground truths with filtering, sorting, and bulk actions.'));
+
+  // Filters
+  const filters = h('div', { className: 'flex items-center gap-2 flex-wrap mb-4' });
+  filters.appendChild(h('span', { className: 'text-sm font-medium text-slate-700' }, 'Status:'));
+  for (const s of ['all', 'draft', 'approved', 'skipped', 'deleted']) {
+    const colors = { all: 'bg-violet-600 text-white', draft: 'bg-amber-600 text-white', approved: 'bg-emerald-600 text-white', skipped: 'bg-slate-600 text-white', deleted: 'bg-rose-600 text-white' };
+    filters.appendChild(h('button', {
+      className: `rounded-lg px-3 py-1.5 text-sm font-medium ${colors[s]}`,
+      onclick: () => toast('info', `Filter: ${s}`)
+    }, s.charAt(0).toUpperCase() + s.slice(1)));
+  }
+  section.appendChild(filters);
+
+  // Table
+  const tableWrap = h('div', { className: 'overflow-x-auto' });
+  const table = h('table', { className: 'w-full table-auto' });
+  const thead = h('thead', { className: 'bg-slate-50 border-b border-slate-200' });
+  const tr = h('tr', { className: 'text-xs font-semibold text-slate-700' });
+  for (const col of ['ID', 'Status', 'User Message', 'Category', 'Required Tools', 'User Context', 'Actions']) {
+    tr.appendChild(h('th', { className: 'px-3 py-3 text-left' }, col));
+  }
+  thead.appendChild(tr);
+  table.appendChild(thead);
+
+  const tbody = h('tbody', { className: 'divide-y divide-slate-100' });
+  for (const item of state.items) {
+    const userMsg = (item.history || []).find(t => t.role === 'user');
+    const row = h('tr', { className: 'hover:bg-slate-50 transition-colors' });
+    row.appendChild(h('td', { className: 'px-3 py-3 text-xs font-mono text-slate-600' }, item.id));
+    const statusTd = h('td', { className: 'px-3 py-3' });
+    statusTd.appendChild(statusBadge(item.status, item.deleted));
+    row.appendChild(statusTd);
+    row.appendChild(h('td', { className: 'px-3 py-3 text-sm max-w-[300px] truncate' }, userMsg ? userMsg.msg : '(none)'));
+    const catTag2 = (item.computedTags || []).find(t => t.startsWith('category:'));
+    row.appendChild(h('td', { className: 'px-3 py-3 text-xs' }, catTag2 ? catTag2.replace('category:', '') : '—'));
+    const requiredTools = getOrderedToolCalls(item).filter(tc => getToolCallDecision(item, tc.id) === 'required');
+    row.appendChild(h('td', { className: 'px-3 py-3 text-xs text-center' }, requiredTools.length > 0 ? `${requiredTools.length} required` : '—'));
+    row.appendChild(h('td', { className: 'px-3 py-3 text-xs text-center' }, `${(item.userContextEntries || []).length}`));
+    const actionTd = h('td', { className: 'px-3 py-3' });
+    const actBtns = h('div', { className: 'flex gap-2' });
+    actBtns.appendChild(h('button', {
+      className: 'rounded-lg border border-violet-300 bg-violet-600 px-3 py-1.5 text-xs text-white hover:bg-violet-700',
+      onclick: () => { selectItem(item.id); state.viewMode = 'curate'; render(); }
+    }, 'Curate'));
+    actionTd.appendChild(actBtns);
+    row.appendChild(actionTd);
+    tbody.appendChild(row);
+  }
+  table.appendChild(tbody);
+  tableWrap.appendChild(table);
+  section.appendChild(tableWrap);
+
+  return section;
+}
+
+function renderStatsView() {
+  const section = h('section', { className: 'flex flex-1 flex-col rounded-2xl border bg-white p-4 shadow-sm overflow-auto' });
+
+  section.appendChild(h('div', { className: 'flex items-center justify-between mb-4' },
+    h('h2', { className: 'text-lg font-semibold' }, 'Agentic Ground Truth — Stats'),
+    h('button', { className: 'rounded-xl border px-3 py-1.5 text-sm hover:bg-violet-50', onclick: () => { state.viewMode = 'curate'; render(); } }, '← Back')
+  ));
+
+  const counts = state.items.reduce((acc, it) => {
+    acc[it.status] = (acc[it.status] || 0) + 1;
+    if (it.deleted) acc.deleted = (acc.deleted || 0) + 1;
+    return acc;
+  }, {});
+
+  const totalRequiredSelections = state.items.reduce((sum, it) => sum + (getOrderedToolCalls(it).some(tc => getToolCallDecision(it, tc.id) === 'required') ? 1 : 0), 0);
+  const totalToolCalls = state.items.reduce((sum, it) => sum + (it.trace && it.trace.chat_history && it.trace.chat_history[0] ? it.trace.chat_history[0].context.length : 0), 0);
+  const totalUserContext = state.items.reduce((sum, it) => sum + (it.userContextEntries || []).length, 0);
+
+  const grid = h('div', { className: 'grid grid-cols-1 gap-3 sm:grid-cols-4' });
+  const statCards = [
+    { label: 'Total Items', val: state.items.length, color: 'violet' },
+    { label: 'Approved', val: counts.approved || 0, color: 'emerald' },
+    { label: 'Draft', val: counts.draft || 0, color: 'amber' },
+    { label: 'Deleted', val: counts.deleted || 0, color: 'rose' },
+  ];
+  for (const s of statCards) {
+    grid.appendChild(h('div', { className: `rounded-xl border bg-${s.color}-50 p-4` },
+      h('div', { className: 'text-xs uppercase tracking-wide text-slate-600' }, s.label),
+      h('div', { className: `mt-1 text-2xl font-bold text-${s.color}-800` }, String(s.val))
+    ));
+  }
+  section.appendChild(grid);
+
+  // Additional stats
+  const grid2 = h('div', { className: 'grid grid-cols-1 gap-3 sm:grid-cols-3 mt-3' });
+  const extraCards = [
+    { label: 'Required Tools Set', val: totalRequiredSelections, color: 'violet' },
+    { label: 'User Context Entries', val: totalUserContext, color: 'amber' },
+    { label: 'Total Tool Calls', val: totalToolCalls, color: 'slate' },
+  ];
+  for (const s of extraCards) {
+    grid2.appendChild(h('div', { className: `rounded-xl border bg-${s.color}-50 p-4` },
+      h('div', { className: 'text-xs uppercase tracking-wide text-slate-600' }, s.label),
+      h('div', { className: `mt-1 text-2xl font-bold text-${s.color}-800` }, String(s.val))
+    ));
+  }
+  section.appendChild(grid2);
+
+  return section;
+}
+
+// ═══════════════════════════════════════════════════════════════
+// MODALS
+// ═══════════════════════════════════════════════════════════════
+
+function renderTagsModal() {
+  if (!state.tagsModalOpen) return null;
+  const item = getCurrent();
+  if (!item) return null;
+
+  const overlay = h('div', { className: 'fixed inset-0 z-50 flex items-center justify-center bg-black/50' });
+  overlay.onclick = (e) => { if (e.target === overlay) { state.tagsModalOpen = false; render(); } };
+
+  const modal = h('div', { className: 'relative z-10 max-h-[90vh] w-full max-w-3xl overflow-hidden rounded-2xl bg-white shadow-2xl' });
+  modal.onclick = (e) => e.stopPropagation();
+
+  // Header
+  const hdr = h('div', { className: 'flex items-center justify-between border-b border-slate-200 bg-violet-50 p-4' });
+  hdr.appendChild(h('div', {},
+    h('h2', { className: 'text-lg font-semibold text-slate-900' }, '🏷 Manage Tags'),
+    h('p', { className: 'text-sm text-slate-600' }, 'Ground Truth Level Tags')
+  ));
+  hdr.appendChild(h('button', {
+    className: 'rounded-lg p-2 text-slate-500 hover:bg-slate-100',
+    onclick: () => { state.tagsModalOpen = false; render(); }
+  }, '✕'));
+  modal.appendChild(hdr);
+
+  // Content
+  const content = h('div', { className: 'max-h-[calc(90vh-10rem)] overflow-auto p-4' });
+
+  // Computed tags
+  if (item.computedTags && item.computedTags.length > 0) {
+    const compDiv = h('div', { className: 'mb-4 rounded-xl border border-slate-200 bg-slate-50 p-4' });
+    compDiv.appendChild(h('div', { className: 'mb-2 text-sm font-medium text-slate-600' }, '🔒 Auto-generated Tags'));
+    const compWrap = h('div', { className: 'flex flex-wrap gap-2' });
+    for (const tag of item.computedTags) {
+      compWrap.appendChild(h('span', { className: 'inline-flex items-center gap-1 rounded-full bg-white border border-slate-200 px-3 py-1 text-sm text-slate-600' }, '🔒 ' + tag));
+    }
+    compDiv.appendChild(compWrap);
+    compDiv.appendChild(h('p', { className: 'mt-2 text-xs text-slate-500' }, 'These tags are automatically generated and cannot be edited.'));
+    content.appendChild(compDiv);
+  }
+
+  // Manual tags selection
+  const groups = {};
+  for (const tag of AVAILABLE_TAGS) {
+    const [prefix, name] = tag.includes(':') ? tag.split(':', 2) : ['misc', tag];
+    if (!groups[prefix]) groups[prefix] = [];
+    groups[prefix].push(name);
+  }
+
+  const tagsGrid = h('div', { className: 'grid gap-4 sm:grid-cols-2' });
+  for (const [group, names] of Object.entries(groups)) {
+    const groupDiv = h('div', { className: 'rounded-xl border border-slate-200 bg-white' });
+    groupDiv.appendChild(h('div', { className: 'px-3 py-2 text-sm font-medium border-b border-slate-200 flex items-center gap-2' },
+      '🏷 ' + group
+    ));
+    const tagsWrap = h('div', { className: 'p-3 flex flex-wrap gap-2' });
+    for (const name of names) {
+      const full = `${group}:${name}`;
+      const isSelected = (item.manualTags || []).includes(full);
+      tagsWrap.appendChild(h('button', {
+        className: `rounded-full px-3 py-1 text-xs font-medium transition-colors border ${isSelected ? 'bg-violet-600 text-white border-violet-600 hover:bg-violet-700' : 'bg-amber-100 text-amber-800 border-amber-200 hover:bg-amber-200'}`,
+        onclick: () => toggleTag(full)
+      }, name));
+    }
+    groupDiv.appendChild(tagsWrap);
+    tagsGrid.appendChild(groupDiv);
+  }
+  content.appendChild(tagsGrid);
+  modal.appendChild(content);
+
+  // Footer
+  const footer = h('div', { className: 'border-t border-slate-200 bg-violet-50 p-4 flex items-center justify-between' });
+  footer.appendChild(h('p', { className: 'text-xs text-slate-700' }, 'Tags help organize and filter ground truths by domain, intent, or quality.'));
+  footer.appendChild(h('button', {
+    className: 'rounded-xl border border-slate-300 bg-white px-4 py-2 text-sm font-medium text-slate-700 hover:bg-slate-50',
+    onclick: () => { state.tagsModalOpen = false; render(); }
+  }, 'Done'));
+  modal.appendChild(footer);
+
+  overlay.appendChild(modal);
+  return overlay;
+}
+
+function renderEditorModal() {
+  if (!state.editorOpen) return null;
+  const item = getCurrent();
+  if (!item) return null;
+
+  const overlay = h('div', { className: 'fixed inset-0 z-50 flex items-center justify-center bg-slate-950/50 p-4' });
+  overlay.onclick = (e) => { if (e.target === overlay) closeEditor(); };
+
+  const modal = h('div', { className: 'relative z-10 w-full max-w-5xl overflow-hidden rounded-3xl border border-slate-200 bg-white shadow-2xl' });
+  modal.onclick = (e) => e.stopPropagation();
+
+  const header = h('div', { className: 'border-b border-slate-200 bg-violet-50 px-6 py-4 flex items-center justify-between' });
+  header.appendChild(h('div', {},
+    h('h2', { className: 'text-lg font-semibold text-slate-900' }, 'Conversation Editor'),
+    h('p', { className: 'text-sm text-slate-600' }, 'Describe what to change, inspect the proposed edits, then apply, deny, or manually edit.')
+  ));
+  header.appendChild(h('button', {
+    className: 'rounded-lg px-3 py-2 text-sm font-medium text-slate-500 hover:bg-white',
+    onclick: closeEditor,
+  }, 'Close'));
+  modal.appendChild(header);
+
+  const body = h('div', { className: 'grid gap-0 lg:grid-cols-[1.1fr,0.9fr]' });
+
+  const chatPane = h('div', { className: 'border-r border-slate-200 p-6 space-y-4' });
+  const promptCard = h('div', { className: 'rounded-2xl bg-slate-900 p-4 text-sm text-white' });
+  promptCard.appendChild(h('div', { className: 'mb-2 text-xs font-semibold uppercase tracking-wide text-violet-200' }, 'Prompt'));
+  const promptArea = h('textarea', {
+    className: 'h-28 w-full rounded-xl border border-slate-700 bg-slate-950/60 p-3 text-sm text-white focus:outline-none focus:ring-2 focus:ring-violet-400',
+    placeholder: 'Example: Make the orchestrator response shorter, add user context about roaming, and mark the billing tool as required.',
+    oninput: (e) => { state.editorPrompt = e.target.value; },
+  });
+  promptArea.value = state.editorPrompt;
+  promptCard.appendChild(promptArea);
+  chatPane.appendChild(promptCard);
+  chatPane.appendChild(h('button', {
+    className: 'rounded-xl bg-violet-600 px-4 py-2 text-sm font-medium text-white hover:bg-violet-700',
+    onclick: generateEditorSuggestions,
+  }, 'Generate Edits'));
+
+  const response = h('div', { className: 'rounded-2xl border border-slate-200 bg-slate-50 p-4' });
+  response.appendChild(h('div', { className: 'mb-2 text-xs font-semibold uppercase tracking-wide text-slate-500' }, 'Proposed edits'));
+  if (!state.editorSuggestions.length) {
+    response.appendChild(h('p', { className: 'text-sm text-slate-500' }, 'No edits generated yet. The chatbot response will appear here.'));
+  } else {
+    const suggestionList = h('div', { className: 'space-y-3' });
+    for (let i = 0; i < state.editorSuggestions.length; i++) {
+      const suggestion = state.editorSuggestions[i];
+      const card = h('div', { className: 'rounded-2xl border border-slate-200 bg-white p-4' });
+      card.appendChild(h('div', { className: 'mb-2 text-sm font-medium text-slate-800' }, suggestion.label));
+
+      if (suggestion.type === 'turnEdit') {
+        const turnSelect = h('select', {
+          className: 'mb-2 w-full rounded-lg border border-slate-300 px-3 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+          onchange: (e) => updateEditorSuggestion(i, 'turnIndex', Number(e.target.value)),
+        });
+        for (let turnIndex = 0; turnIndex < item.history.length; turnIndex++) {
+          const option = h('option', { value: String(turnIndex) }, `${item.history[turnIndex].role}`);
+          if (turnIndex === suggestion.turnIndex) option.selected = true;
+          turnSelect.appendChild(option);
+        }
+        card.appendChild(turnSelect);
+
+        const textArea = h('textarea', {
+          className: 'h-32 w-full rounded-lg border border-slate-300 p-3 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+          oninput: (e) => updateEditorSuggestion(i, 'value', e.target.value),
+        });
+        textArea.value = suggestion.value;
+        card.appendChild(textArea);
+      }
+
+      if (suggestion.type === 'userContext') {
+        const grid = h('div', { className: 'grid gap-2 md:grid-cols-2' });
+        const keyInput = h('input', {
+          className: 'rounded-lg border border-slate-300 px-3 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+          value: suggestion.key,
+          oninput: (e) => updateEditorSuggestion(i, 'key', e.target.value),
+        });
+        const valueInput = h('input', {
+          className: 'rounded-lg border border-slate-300 px-3 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+          value: suggestion.value,
+          oninput: (e) => updateEditorSuggestion(i, 'value', e.target.value),
+        });
+        grid.appendChild(keyInput);
+        grid.appendChild(valueInput);
+        card.appendChild(grid);
+      }
+
+      if (suggestion.type === 'toolDecision') {
+        card.appendChild(h('p', { className: 'mb-2 text-sm text-slate-600' }, suggestion.functionName));
+        const select = h('select', {
+          className: 'w-full rounded-lg border border-slate-300 px-3 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+          onchange: (e) => updateEditorSuggestion(i, 'value', e.target.value),
+        });
+        const requiredOption = h('option', { value: 'required' }, '★ Required');
+        const optionalOption = h('option', { value: 'optional' }, '○ Optional');
+        const notNeededOption = h('option', { value: 'not-needed' }, '✕ Not needed');
+        if (suggestion.value === 'required') requiredOption.selected = true;
+        if (suggestion.value === 'optional') optionalOption.selected = true;
+        if (suggestion.value === 'not-needed') notNeededOption.selected = true;
+        select.appendChild(requiredOption);
+        select.appendChild(optionalOption);
+        select.appendChild(notNeededOption);
+        card.appendChild(select);
+      }
+
+      suggestionList.appendChild(card);
+    }
+    response.appendChild(suggestionList);
+  }
+  chatPane.appendChild(response);
+  body.appendChild(chatPane);
+
+  const manualPane = h('div', { className: 'p-6 space-y-4 bg-white' });
+  manualPane.appendChild(h('div', {},
+    h('div', { className: 'text-sm font-medium text-slate-800' }, 'Manual Edit'),
+    h('p', { className: 'text-sm text-slate-500 mt-1' }, 'Use the assistant suggestions as a starting point or switch to direct manual editing.')
+  ));
+  manualPane.appendChild(h('button', {
+    className: 'rounded-xl border border-slate-300 px-4 py-2 text-sm font-medium text-slate-700 hover:bg-slate-50',
+    onclick: () => { state.editorManualMode = !state.editorManualMode; render(); },
+  }, state.editorManualMode ? 'Hide Manual Edit' : 'Show Manual Edit'));
+
+  if (state.editorManualMode) {
+    const targetSelect = h('select', {
+      className: 'w-full rounded-lg border border-slate-300 px-3 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+      onchange: (e) => syncEditorManualTarget(e.target.value),
+    });
+    for (const role of ['user', 'orchestrator-agent', 'output-agent']) {
+      const option = h('option', { value: role }, role);
+      if (role === state.editorManualTarget) option.selected = true;
+      targetSelect.appendChild(option);
+    }
+    manualPane.appendChild(targetSelect);
+
+    const manualText = h('textarea', {
+      className: 'h-64 w-full rounded-2xl border border-slate-300 p-4 text-sm focus:outline-none focus:ring-2 focus:ring-violet-300',
+      oninput: (e) => { state.editorManualText = e.target.value; },
+    });
+    manualText.value = state.editorManualText;
+    manualPane.appendChild(manualText);
+
+    manualPane.appendChild(h('button', {
+      className: 'rounded-xl border border-emerald-300 bg-emerald-50 px-4 py-2 text-sm font-medium text-emerald-700 hover:bg-emerald-100',
+      onclick: applyManualEditorChange,
+    }, 'Apply Manual Edit'));
+  }
+
+  const footer = h('div', { className: 'pt-4 flex flex-wrap items-center gap-2 border-t border-slate-200' });
+  footer.appendChild(h('button', {
+    className: 'rounded-xl bg-violet-600 px-4 py-2 text-sm font-medium text-white hover:bg-violet-700 disabled:opacity-50',
+    onclick: state.editorSuggestions.length ? applyEditorSuggestions : null,
+  }, 'Apply'));
+  footer.appendChild(h('button', {
+    className: 'rounded-xl border border-slate-300 bg-white px-4 py-2 text-sm font-medium text-slate-700 hover:bg-slate-50',
+    onclick: denyEditorSuggestions,
+  }, 'Deny'));
+  manualPane.appendChild(footer);
+  body.appendChild(manualPane);
+
+  modal.appendChild(body);
+  overlay.appendChild(modal);
+  return overlay;
+}
+
+// ═══════════════════════════════════════════════════════════════
+// TOASTS
+// ═══════════════════════════════════════════════════════════════
+
+function renderToasts() {
+  if (state.toasts.length === 0) return null;
+  const container = h('div', { className: 'fixed bottom-4 right-4 z-50 space-y-2' });
+  for (const t of state.toasts) {
+    const colors = { success: 'border-emerald-300', error: 'border-rose-300', info: 'border-violet-300' };
+    container.appendChild(h('div', {
+      className: `flex min-w-[260px] items-center justify-between gap-3 rounded-xl border bg-white px-3 py-2 shadow-lg ${colors[t.kind] || ''}`
+    }, h('div', { className: 'text-sm' }, t.msg)));
+  }
+  return container;
+}
+
+// ═══════════════════════════════════════════════════════════════
+// MAIN RENDER
+// ═══════════════════════════════════════════════════════════════
+
+function render() {
+  const app = document.getElementById('app');
+  app.innerHTML = '';
+
+  const wrapper = h('div', { className: 'flex h-screen w-screen flex-col overflow-hidden' });
+
+  // Accent bar
+  wrapper.appendChild(h('div', { className: 'h-1 w-full flex-none bg-gradient-to-r from-violet-500 via-fuchsia-500 to-pink-500' }));
+
+  // Header
+  wrapper.appendChild(renderHeader());
+
+  // Main content
+  const main = h('main', { className: 'mx-auto flex w-full max-w-none flex-1 flex-col gap-4 p-4 min-h-0' });
+
+  if (state.viewMode === 'stats') {
+    main.appendChild(renderStatsView());
+  } else if (state.viewMode === 'explorer') {
+    main.appendChild(renderExplorerView());
+  } else {
+    // Curate view
+    const grid = h('div', { className: 'grid grid-cols-12 gap-4 flex-1 min-h-0' });
+
+    if (state.sidebarOpen) {
+      grid.appendChild(renderQueueSidebar());
+    }
+
+    const curateCol = h('div', {
+      className: state.sidebarOpen ? 'col-span-9' : 'col-span-12'
+    });
+    curateCol.appendChild(renderCuratePane(getCurrent()));
+    grid.appendChild(curateCol);
+
+    main.appendChild(grid);
+  }
+
+  wrapper.appendChild(main);
+  app.appendChild(wrapper);
+
+  // Evidence drawer (mobile)
+  if (state.viewMode === 'curate' && window.innerWidth < 1024) {
+    const currentItem = getCurrent();
+    if (currentItem) {
+      // Backdrop
+      if (state.evidenceDrawerOpen) {
+        const backdrop = h('div', {
+          className: 'fixed inset-0 bg-black/20 z-40',
+          onclick: toggleEvidenceDrawer
+        });
+        app.appendChild(backdrop);
+      }
+      // Drawer
+      const drawer = h('div', {
+        className: `evidence-drawer fixed right-0 top-0 bottom-0 bg-white shadow-2xl z-50 flex flex-col ${state.evidenceDrawerOpen ? '' : 'closed'}`,
+        style: { width: '90vw', maxWidth: '480px' }
+      });
+      const drawerHeader = h('div', { className: 'px-4 py-3 border-b flex items-center justify-between flex-none' });
+      drawerHeader.appendChild(h('span', { className: 'text-sm font-semibold text-slate-600 uppercase tracking-wider' }, 'Evidence Panel'));
+      drawerHeader.appendChild(h('button', {
+        className: 'text-slate-400 hover:text-slate-600 p-1',
+        onclick: toggleEvidenceDrawer
+      }, '✕'));
+      drawer.appendChild(drawerHeader);
+
+      const drawerContent = h('div', { className: 'flex-1 overflow-y-auto p-3 space-y-3' });
+      drawerContent.appendChild(renderTracePanel(currentItem));
+      drawerContent.appendChild(renderMetadataPanel(currentItem));
+      drawer.appendChild(drawerContent);
+      app.appendChild(drawer);
+    }
+  }
+
+  // Modals
+  const tagsModal = renderTagsModal();
+  if (tagsModal) app.appendChild(tagsModal);
+
+  const editorModal = renderEditorModal();
+  if (editorModal) app.appendChild(editorModal);
+
+  // Toasts
+  const toasts = renderToasts();
+  if (toasts) app.appendChild(toasts);
+
+  // Bind resizable gutter events
+  bindGutterEvents();
+}
+
+// ═══════════════════════════════════════════════════════════════
+// RESIZABLE GUTTER
+// ═══════════════════════════════════════════════════════════════
+
+function bindGutterEvents() {
+  const gutter = document.getElementById('split-gutter');
+  const convPane = document.getElementById('conversation-pane');
+  const evidPane = document.getElementById('evidence-pane');
+
+  if (!gutter || !convPane || !evidPane) return;
+
+  let isDragging = false;
+  let startX, startConvFlex, startEvidFlex;
+
+  gutter.addEventListener('mousedown', (e) => {
+    isDragging = true;
+    startX = e.clientX;
+    const convRect = convPane.getBoundingClientRect();
+    const evidRect = evidPane.getBoundingClientRect();
+    const total = convRect.width + evidRect.width;
+    startConvFlex = convRect.width / total * 10;
+    startEvidFlex = evidRect.width / total * 10;
+    gutter.classList.add('dragging');
+    document.body.style.cursor = 'col-resize';
+    document.body.style.userSelect = 'none';
+  });
+
+  document.addEventListener('mousemove', (e) => {
+    if (!isDragging) return;
+    const dx = e.clientX - startX;
+    const containerWidth = convPane.parentElement.getBoundingClientRect().width - 4;
+    const dFlex = (dx / containerWidth) * 10;
+    const newConv = Math.max(2, Math.min(8, startConvFlex + dFlex));
+    const newEvid = Math.max(2, Math.min(8, startEvidFlex - dFlex));
+    convPane.style.flex = newConv;
+    evidPane.style.flex = newEvid;
+  });
+
+  document.addEventListener('mouseup', () => {
+    if (isDragging) {
+      isDragging = false;
+      gutter.classList.remove('dragging');
+      document.body.style.cursor = '';
+      document.body.style.userSelect = '';
+    }
+  });
+}
+
+// Re-render on resize for responsive split pane ↔ drawer
+window.addEventListener('resize', () => render());
+
+// Initial render
+render();
+</script>
+</body>
+</html>
diff --git a/wireframes/gt_schema_v5_generic.py b/wireframes/gt_schema_v5_generic.py
new file mode 100644
index 0000000..c3ec7a1
--- /dev/null
+++ b/wireframes/gt_schema_v5_generic.py
@@ -0,0 +1,170 @@
+from __future__ import annotations
+
+from typing import Any, Literal
+
+from pydantic import BaseModel, Field, field_validator, model_validator
+
+
+class HistoryEntry(BaseModel):
+    role: str
+    msg: str
+
+    model_config = {"extra": "forbid"}
+
+
+class ContextEntry(BaseModel):
+    key: str
+    value: Any
+
+    model_config = {"extra": "forbid"}
+
+
+class FeedbackEntry(BaseModel):
+    source: str = ""
+    values: dict[str, Any] = Field(default_factory=dict)
+
+    model_config = {"extra": "forbid"}
+
+
+class ToolCallRecord(BaseModel):
+    id: str = ""
+    name: str
+    call_type: Literal["tool", "subagent"] = Field("tool", alias="callType")
+    agent: str | None = None
+    # which step in the agent execution was this tool called? each step is between agent respnonses to the user. multiple tools can be called in the same step.
+    sequence_number: int | None = Field(None, alias="stepNumber")
+    parallel_group: str | None = Field(None, alias="parallelGroup")
+    parent_call_id: str | None = Field(None, alias="parentCallId")
+    response: Any = None
+
+    model_config = {"extra": "forbid", "populate_by_name": True}
+
+
+class PluginPayload(BaseModel):
+    kind: str
+    version: str = "1.0"
+    data: dict[str, Any] = Field(default_factory=dict)
+
+    model_config = {"extra": "forbid"}
+
+
+class ToolExpectation(BaseModel):
+    name: str
+    arguments: dict[str, Any] | str | None = None
+
+    model_config = {"extra": "forbid"}
+
+
+class ExpectedTools(BaseModel):
+    """Tool expectations. Every tool defaults to allowed unless listed here."""
+
+    required: list[ToolExpectation] = Field(default_factory=list)
+    optional: list[ToolExpectation] = Field(default_factory=list)
+    not_needed: list[ToolExpectation] = Field(default_factory=list)
+
+    model_config = {"extra": "forbid"}
+
+    @field_validator("required", "optional", "not_needed", mode="before")
+    @classmethod
+    def _coerce_string_entries(cls, value: object) -> object:
+        if not isinstance(value, list):
+            return value
+        normalized: list[object] = []
+        for item in value:
+            if isinstance(item, str):
+                normalized.append({"name": item})
+            else:
+                normalized.append(item)
+        return normalized
+
+    @model_validator(mode="after")
+    def _reject_overlap(self) -> ExpectedTools:
+        required_names = {tool.name for tool in self.required}
+        optional_names = {tool.name for tool in self.optional}
+        not_needed_names = {tool.name for tool in self.not_needed}
+        overlap = sorted(
+            (required_names & optional_names)
+            | (required_names & not_needed_names)
+            | (optional_names & not_needed_names)
+        )
+        if overlap:
+            raise ValueError(
+                f"tools cannot appear in more than one category: {', '.join(overlap)}"
+            )
+        return self
+
+# ---------------------------------------------------------------------------
+# Generic GT schema
+# ---------------------------------------------------------------------------
+
+
+class AgenticGroundTruthEntry(BaseModel):
+    # --- Core identity / storage ---
+    id: str = ""
+    dataset_name: str = Field("", alias="datasetName")
+    bucket: str = ""
+    doc_type: str = Field("ground-truth", alias="docType")
+    schema_version: str = Field("agentic-core/v1", alias="schemaVersion")
+    status: str = "draft"
+    etag: str = Field("", alias="_etag")
+    assigned_to: str = Field("", alias="assignedTo")
+    assigned_at: str = Field("", alias="assignedAt")
+    updated_at: str = Field("", alias="updatedAt")
+    updated_by: str = Field("", alias="updatedBy")
+    reviewed_at: str | None = Field(None, alias="reviewedAt")
+    manual_tags: list[str] = Field(default_factory=list, alias="manualTags")
+    computed_tags: list[str] = Field(default_factory=list, alias="computedTags")
+
+    # --- Scenario content ---
+    scenario_id: str = Field("", alias="scenarioId")
+    history: list[HistoryEntry] = Field(default_factory=list)
+    context_entries: list[ContextEntry] = Field(default_factory=list, alias="contextEntries")
+
+    # --- Agentic execution data ---
+    trace_ids: dict[str, str] | None = Field(None, alias="traceIds")
+    tool_calls: list[ToolCallRecord] = Field(default_factory=list, alias="toolCalls")
+    expected_tools: ExpectedTools = Field(default_factory=ExpectedTools, alias="expectedTools")
+
+    # --- Flexible extension surfaces ---
+    feedback: list[FeedbackEntry] = Field(default_factory=list)
+    metadata: dict[str, Any] = Field(default_factory=dict)
+    plugins: dict[str, PluginPayload] = Field(default_factory=dict)
+    comment: str = ""
+
+    # --- Provenance ---
+    created_by: str | None = None
+    created_at: str | None = None
+
+    # --- Stored at the bottom for better readability ---
+    trace_payload: dict[str, Any] = Field(default_factory=dict, alias="tracePayload")
+
+    model_config = {"extra": "forbid", "populate_by_name": True}
+
+
+    def set_plugin(self, slot: str, data: dict[str, Any], *, version: str = "1.0") -> None:
+        """Attach opaque customer- or feature-specific data under a named plugin slot.
+        """
+
+        self.plugins[slot] = PluginPayload(kind=slot, version=version, data=data)
+
+    def get_plugin_data(self, slot: str) -> dict[str, Any] | None:
+        plugin = self.plugins.get(slot)
+        return None if plugin is None else plugin.data
+
+    def export_json_schema(self) -> dict[str, Any]:
+        return self.model_json_schema()
+
+
+
+__all__ = [
+    "AgenticGroundTruthEntry",
+    "ContextEntry",
+    "ExpectedOutput",
+    "ExpectedTools",
+    "FeedbackEntry",
+    "GTMetadata",
+    "HistoryEntry",
+    "PluginPayload",
+    "ToolExpectation",
+    "ToolCallRecord",
+]
\ No newline at end of file