Skip to content

feat(workflows): add beval behavioral evaluation workflow for dt-coach agent#1129

Open
eedorenko wants to merge 75 commits intomicrosoft:mainfrom
eedorenko:eedorenko/beval
Open

feat(workflows): add beval behavioral evaluation workflow for dt-coach agent#1129
eedorenko wants to merge 75 commits intomicrosoft:mainfrom
eedorenko:eedorenko/beval

Conversation

@eedorenko
Copy link
Copy Markdown
Contributor

Description

Adds a behavioral evaluation (beval) CI workflow for the dt-coach agent using GitHub Copilot CLI over ACP (TCP). The workflow:

  • Starts two Copilot CLI instances (agent on port 3000, judge on port 3001)
  • Runs beval evaluations against cases defined in beval/cases/
  • Uploads results as a workflow artifact

Also pins all GitHub Actions dependencies to SHA hashes for supply chain security, and installs beval from the default branch of the vyta/beval repo.

Sample eval run: https://github.com/eedorenko/hve-core/actions/runs/23311489579/job/67799722616

======================================================================
  SCORECARD
======================================================================
  Overall: 0.81  (30/30 cases passed)

  Metric                Score  Bar
  -------------------- ------  ------------
  latency               0.86  [#########-]
  quality               0.78  [########--]

  Case                                      Score  Status
  ---------------------------------------- ------  ------
  Response follows Think/Speak/Empower ...  0.66  + PASS
  Keep responses concise — no methodolo...  0.86  + PASS
  End with choices not directives           0.91  + PASS
  Work WITH users, not FOR them             0.95  + PASS
  Do not prescribe specific solutions t...  0.93  + PASS
  Stay curious and supportive when user...  0.81  + PASS
  Method 1: Assess whether request is f...  0.93  + PASS
  Method 1: Guide stakeholder identific...  0.82  + PASS
  Method 2: Help plan systematic research   0.68  + PASS
  Method 3: Guide pattern recognition f...  0.81  + PASS
  Method 4: Facilitate divergent ideation   0.65  + PASS
  Method 5: Guide concept creation for ...  0.65  + PASS
  Method 6: Encourage scrappy constrain...  0.86  + PASS
  Method 7: Guide technical feasibility...  0.65  + PASS
  Method 8: Structure user testing for ...  0.65  + PASS
  Method 9: Guide continuous optimizati...  0.63  + PASS
  Start with broad hints when user is s...  0.73  + PASS
  Escalate hints when user remains stuck    0.90  + PASS
  Accept backward transitions between m...  0.83  + PASS
  Announce method shifts transparently      0.84  + PASS
  Avoid multiple-choice question lists      0.81  + PASS
  Do not change method focus without an...  0.95  + PASS
  Resume session with state context         0.63  + PASS
  Ask for project slug during initializ...  0.94  + PASS
  Gather role, team, and method focus d...  0.92  + PASS
  Default to Method 1 for new projects      0.88  + PASS
  Ask targeted, open-ended questions du...  0.88  + PASS
  Summarize progress and check direction    0.78  + PASS
  Recap accomplishments and confirm met...  0.83  + PASS
  Summarize session and suggest next st...  0.89  + PASS

  Avg time: 38.5s
======================================================================

Related Issue(s)

Type of Change

Code & Documentation:

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature causing existing functionality to change)
  • Documentation update

Infrastructure & Configuration:

  • GitHub Actions workflow
  • Linting configuration (markdown, PowerShell, etc.)
  • Security configuration
  • DevContainer configuration
  • Dependency update

AI Artifacts:

  • Reviewed contribution with prompt-builder agent and addressed all feedback
  • Copilot instructions (.github/instructions/*.instructions.md)
  • Copilot prompt (.github/prompts/*.prompt.md)
  • Copilot agent (.github/agents/*.agent.md)
  • Copilot skill (.github/skills/*/SKILL.md)

Other:

  • Script/automation (.ps1, .sh, .py)
  • Other (please describe):

Testing

The workflow has been validated by triggering it manually via workflow_dispatch. All 30 evaluation cases passed with an overall score of 0.81.

Checklist

Required Checks

  • Documentation is updated (if applicable)
  • Files follow existing naming conventions
  • Changes are backwards compatible (if applicable)
  • Tests added for new functionality (if applicable)

AI Artifact Contributions

  • Used /prompt-analyze to review contribution
  • Addressed all feedback from prompt-builder review
  • Verified contribution follows common standards and type-specific requirements

Required Automated Checks

  • Markdown linting: npm run lint:md
  • Spell checking: npm run spell-check
  • Frontmatter validation: npm run lint:frontmatter
  • Skill structure validation: npm run validate:skills
  • Link validation: npm run lint:md-links
  • PowerShell analysis: npm run lint:ps
  • Plugin freshness: npm run plugin:generate

Security Considerations

  • This PR does not contain any sensitive or NDA information
  • Any new dependencies have been reviewed for security issues
  • Security-related scripts follow the principle of least privilege

Additional Notes

All GitHub Actions uses: steps are pinned to SHA hashes per supply chain security best practices.

eedorenko and others added 18 commits March 16, 2026 13:24
Add 30 test cases across 4 categories (coaching behaviors, session phases,
method guidance, progressive hints) with ACP judge integration. Include
reusable CI workflow and PR validation hook with fork guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed port specification from agent startup command.
Add prompt to copilot agent startup command.
Added working-directory to Start agent step in beval.yml
Switch to init_prompt to reliably activate the dt-coach agent in ACP
sessions. Remove --agent flag from copilot TCP start, add port-readiness
polling. Add agent identity verification case. Copy dt-coach.agent.md
to .github/agents/ for flat discovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pin actions/checkout, actions/setup-python, and actions/upload-artifact
to SHA hashes to satisfy hve-core dependency pinning policy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes "Directory path must be absolute: ." error from copilot agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add model to agent.yaml and eval.config.yaml connection config so it
is applied via set_session_model. Remove --model from workflow CLI args.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove branch pin from beval pip install so it uses the default
branch of the vyta/beval repo instead of eedorenko/skill-agent.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@eedorenko eedorenko requested a review from a team as a code owner March 19, 2026 19:36
Comment thread .github/workflows/test-token.yml Fixed
Comment thread .github/workflows/test-token.yml Fixed
Comment thread .github/workflows/test-token.yml Fixed
- Add beval, wireframes, parseable to cspell dictionary
- Ignore beval/results/** from spell check (generated output)
- Add top-level and job-level permissions blocks to test-token.yml

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@eedorenko eedorenko marked this pull request as draft March 19, 2026 20:43
eedorenko and others added 2 commits March 19, 2026 13:45
- Add behavioral evaluation job to release-stable.yml
- Remove test-token.yml debug workflow
- Remove dt-coach.agent.md (not part of this contribution)
- Remove beval/results/ (generated output, not for source control)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Run npm audit fix to update flatted to a non-vulnerable version.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@eedorenko eedorenko marked this pull request as ready for review March 19, 2026 21:00
@eedorenko
Copy link
Copy Markdown
Contributor Author

FYI @vyta @bjcmit

Copy link
Copy Markdown
Member

@WilliamBerryiii WilliamBerryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, @eedorenko. Behavioral evaluation for the dt-coach agent is a valuable addition, and we appreciate the effort to formalize agent quality testing with structured evaluation cases.

After reviewing the workflow changes against our CI security standards, we've identified several issues that need to be resolved before this can merge. The findings fall into two categories: supply-chain security violations in the beval workflow, and architectural concerns with integrating it into PR validation and release pipelines.

Important

The combination of unpinned dependencies from an external personal repository, unpinned npm range installs, inherited secrets, and persisted credentials creates a compound risk. A compromise of any one dependency effectively grants access to all repository secrets and the CI execution context.

We've added inline comments on each affected file with specific context and suggested changes. The critical items are:

  • pip install from vyta/beval with no commit SHA and no hash verification (see comment on beval.yml line 32)
  • npm install -g @github/copilot@1 with a major-version range and no lockfile (see comment on beval.yml line 29)
  • actions/checkout without persist-credentials: false (see comment on beval.yml line 21)
  • Both copilot instances launch with --allow-all, granting unrestricted permissions (see comment on beval.yml line 36)
  • secrets: inherit in both calling workflows forwards all repository secrets when only COPILOT_TOKEN is needed
  • Behavioral evaluation should not gate PR merges or releases at this stage (see comments on pr-validation.yml and release-stable.yml)

Our repository enforces these standards through Test-DependencyPinning.ps1, Test-WorkflowPermissions.ps1, and the conventions documented in workflows.instructions.md. The copilot-setup-steps.yml workflow demonstrates the expected pattern for downloading and verifying external binaries.

We recommend deploying beval as a standalone workflow_dispatch or scheduled workflow instead of integrating it into pr-validation.yml and release-stable.yml. This allows behavioral testing to proceed without gating contributor workflows or release processes.

Please comment if you have questions about any of the suggestions, and we can discuss further.

Comment thread .github/workflows/beval.yml
Comment thread .github/workflows/beval.yml Outdated
Comment thread .github/workflows/beval.yml Outdated
Comment thread .github/workflows/beval.yml Outdated
Comment thread .github/workflows/pr-validation.yml Outdated
Comment thread .github/workflows/release-stable.yml Outdated
Comment thread .github/workflows/release-stable.yml Outdated
@eedorenko
Copy link
Copy Markdown
Contributor Author

All comments and alerts are addressed

@WilliamBerryiii
Copy link
Copy Markdown
Member

@eedorenko - thanks for getting this here. I'm gonna try to get this set up and running this week ... just bear with me as I try to get through some backlog of stuff.

Copy link
Copy Markdown
Collaborator

@chaosdinosaur chaosdinosaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid addition of behavioral evaluation for the DT Coach agent — the test cases are well-crafted and closely aligned with the agent's Think/Speak/Empower philosophy. SHA pinning on Actions and the beval install are good.

A few items to address: cspell ignore path mismatch with actual results path, missing concurrency block per repo conventions, and the personal-repo supply chain consideration for the beval dependency. Minor: cspell word ordering and lockfile noise from merge churn.

Comment thread .github/workflows/beval.yml
Comment thread .github/workflows/beval.yml
Comment thread .cspell.json
Comment thread .cspell.json
Comment thread package-lock.json
@WilliamBerryiii WilliamBerryiii changed the title ci: add beval behavioral evaluation workflow for dt-coach agent feat(workflows): add beval behavioral evaluation workflow for dt-coach agent Apr 23, 2026
eedorenko and others added 2 commits April 23, 2026 13:36
- Add concurrency block to beval.yml per repo conventions
- Add supply-chain context comment on beval personal-repo install
- Fix cspell ignorePaths to match actual results output path
- Sort cspell words list alphabetically
- Reset package.json and package-lock.json to main to remove merge churn

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts in .cspell.json (keep both beval and behavioural,
deduplicate smol, maintain alphabetical order), take upstream versions
of package.json and package-lock.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eedorenko eedorenko requested a review from chaosdinosaur April 24, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants