-
Notifications
You must be signed in to change notification settings - Fork 172
feat(workflows): add beval behavioral evaluation workflow for dt-coach agent #1129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
eedorenko
wants to merge
75
commits into
microsoft:main
Choose a base branch
from
eedorenko:eedorenko/beval
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
75 commits
Select commit
Hold shift + click to select a range
af3f6ca
feat: add beval behavioral evaluation for dt-coach agent
eedorenko ef56eae
Update copilot command to use claude-opus model
eedorenko 8faa4ea
Simplify agent startup command in beval.yml
eedorenko 50a03dd
Modify copilot command to include prompt
eedorenko 26fcbe7
Specify working directory for Start agent step
eedorenko b2feabd
fix: use init_prompt for agent activation, add identity case
eedorenko 5a7ae11
fix: pin GitHub Actions dependencies to SHA hashes
eedorenko ade4c27
ci: trigger beval workflow test
eedorenko c708932
ci: add token debug workflow
eedorenko 01849f7
ci: add token verification step to beval workflow
eedorenko de9e55e
ci: use claude-opus-4.6-fast model for agent and judge
eedorenko 859fa91
ci: use claude-opus-4.6-1m model and add debug logging
eedorenko 7e5afbe
ci: set AGENT_REPO_ROOT to absolute workspace path
eedorenko 967c680
ci: temporarily run only agent_identity case
eedorenko 3bf5071
ci: set model via ACP session instead of CLI flag
eedorenko b00d3f0
ci: remove token verification step and run full test suite
eedorenko 4f1a9c2
chore: remove debug logging and agent_identity test case
eedorenko fcaf374
ci: install beval from default branch
eedorenko d662c71
Merge branch 'main' into eedorenko/beval
eedorenko d1c8b08
ci: fix spell check failures and workflow permissions
eedorenko b156cf7
ci: add beval to release pipeline and clean up debug artifacts
eedorenko 86b028e
fix: resolve flatted prototype pollution vulnerability
eedorenko 0b867a3
ci: temporarily run only agent identity smoke test
eedorenko 6a61043
ci: restore full test suite after agent identity verification
eedorenko 5a288b6
ci: pin Copilot CLI to exact version and use npm ci
eedorenko 373b4c6
ci: pin beval install to specific commit SHA
eedorenko 6f932cd
ci: replace --allow-all with least-privilege tool permissions
eedorenko a5f8c4b
ci: use explicit secret forwarding instead of secrets: inherit
eedorenko dc17252
ci: pin beval to fix for missing request_permission in ACPJudgeClient
eedorenko c343528
ci: omit permission flags from Copilot CLI ACP server
eedorenko 3d491eb
ci: make beval non-blocking in PR and release workflows
eedorenko d44bab9
ci: scope COPILOT_GITHUB_TOKEN to agent and judge steps only
eedorenko d7e0019
ci: remove beval from PR and release pipelines
eedorenko d6e0e80
refactor: scope beval files under dt-coach subdirectory
eedorenko 5ff0c54
chore: merge upstream/main
eedorenko 5976fba
ci: fix beval SHA pin (correct full SHA for b92c200)
eedorenko 78e32a0
test(beval): strengthen Method 8 case with role and project context
eedorenko 93d6137
ci: update beval SHA pin to 1f01760 (fix import order)
eedorenko 61dcbdd
Merge branch 'main' into eedorenko/beval
eedorenko d4e85fc
ci: allow @github/copilot packages in dependency review
eedorenko b743d2f
Merge branch 'main' into eedorenko/beval
eedorenko ca9daa1
ci: replace npm ci with exact-version global install for Copilot CLI
eedorenko a430a06
Merge branch 'main' into eedorenko/beval
eedorenko a3bf74d
ci: pin beval to main branch merge commit (a2effa1)
eedorenko e2b0414
Merge branch 'main' into eedorenko/beval
eedorenko 70a0bd7
Merge branch 'main' into eedorenko/beval
eedorenko 0e23ce4
Merge branch 'main' into eedorenko/beval
eedorenko 172ef62
Merge branch 'main' into eedorenko/beval
eedorenko 66fe5b1
ci: restore npm ci and exempt @github/copilot from license check
eedorenko 30762ac
Merge branch 'main' into eedorenko/beval
eedorenko 70d59c6
Merge branch 'main' into eedorenko/beval
eedorenko d02597d
Merge branch 'main' into eedorenko/beval
eedorenko 0bdc402
Merge branch 'main' into eedorenko/beval
eedorenko feca948
Merge branch 'main' into eedorenko/beval
eedorenko 41796da
Merge branch 'main' into eedorenko/beval
eedorenko 9c256a4
Merge branch 'main' into eedorenko/beval
eedorenko b7035d4
fix: add missing comma in allow-dependencies-licenses list
eedorenko c5cb5e3
Merge branch 'main' into eedorenko/beval
eedorenko 292f7d6
Merge branch 'main' into eedorenko/beval
eedorenko 11fd4e7
Merge branch 'main' into eedorenko/beval
eedorenko eedd6cf
Merge branch 'main' into eedorenko/beval
eedorenko 8a01cca
Merge branch 'main' into eedorenko/beval
eedorenko 033d010
Merge branch 'main' into eedorenko/beval
eedorenko da126e6
Merge branch 'main' into eedorenko/beval
eedorenko 2e96685
Merge branch 'main' into eedorenko/beval
eedorenko bb3a172
Merge branch 'main' into eedorenko/beval
eedorenko ca91e7a
Merge branch 'main' into eedorenko/beval
eedorenko d43c8dd
Merge branch 'main' into eedorenko/beval
eedorenko 75b41ce
Merge branch 'main' into eedorenko/beval
eedorenko a39650a
Merge branch 'main' into eedorenko/beval
eedorenko 519d4e8
fix: address review feedback from chaosdinosaur
eedorenko 385c86e
chore: merge upstream/main and resolve conflicts
eedorenko 36bcbd5
Merge branch 'main' into eedorenko/beval
eedorenko ba171ef
Merge branch 'main' into eedorenko/beval
eedorenko 184e948
Merge branch 'main' into eedorenko/beval
eedorenko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| name: Behavioral Evaluation (beval) | ||
|
|
||
| on: | ||
| workflow_call: | ||
| secrets: | ||
| COPILOT_TOKEN: | ||
| required: true | ||
| workflow_dispatch: | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
eedorenko marked this conversation as resolved.
|
||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: false | ||
|
|
||
| jobs: | ||
| evaluate: | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 | ||
|
|
||
| env: | ||
| AGENT_REPO_ROOT: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v4.2.2 | ||
|
eedorenko marked this conversation as resolved.
|
||
| with: | ||
| persist-credentials: false | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 | ||
| with: | ||
| python-version: "3.12" | ||
|
|
||
| - name: Install GitHub Copilot CLI | ||
| run: | | ||
| npm ci --prefix beval | ||
| echo "${{ github.workspace }}/beval/node_modules/.bin" >> "$GITHUB_PATH" | ||
|
|
||
| - name: Install beval | ||
| # beval is hosted under a personal account (vyta) while an org-owned | ||
| # home is evaluated. The install is pinned to a specific commit SHA to | ||
| # mitigate supply-chain risk in the interim. | ||
| run: pip install --no-cache-dir "beval[all] @ git+https://github.com/vyta/beval.git@a2effa10cec1b06c394811587fede0070174d589#subdirectory=python" | ||
|
eedorenko marked this conversation as resolved.
|
||
|
|
||
| - name: Start agent (TCP) | ||
| env: | ||
| COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_TOKEN }} | ||
| run: | | ||
| copilot --acp --port 3000 & | ||
|
eedorenko marked this conversation as resolved.
|
||
| for i in $(seq 1 30); do | ||
| nc -z 127.0.0.1 3000 && break | ||
| echo "Waiting for agent to start ($i)..." | ||
| sleep 2 | ||
| done | ||
| nc -z 127.0.0.1 3000 || { echo "Agent failed to start"; exit 1; } | ||
|
|
||
| - name: Start judge (TCP) | ||
| env: | ||
| COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_TOKEN }} | ||
| run: | | ||
| copilot --acp --port 3001 & | ||
| for i in $(seq 1 30); do | ||
| nc -z 127.0.0.1 3001 && break | ||
| echo "Waiting for judge to start ($i)..." | ||
| sleep 2 | ||
| done | ||
| nc -z 127.0.0.1 3001 || { echo "Judge failed to start"; exit 1; } | ||
|
|
||
| - name: Run evaluations | ||
| run: | | ||
| beval \ | ||
| -c beval/dt-coach/eval.config.yaml \ | ||
| run \ | ||
| --cases beval/dt-coach/cases/ \ | ||
| --agent beval/dt-coach/agent.yaml \ | ||
| -m validation \ | ||
| -o beval/dt-coach/results/results.json | ||
|
|
||
| - name: Upload results | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v4.4.3 | ||
| if: always() | ||
| with: | ||
| name: beval-results-${{ github.run_id }} | ||
| path: beval/dt-coach/results/ | ||
| retention-days: 30 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| name: dt-coach | ||
| description: > | ||
| Design Thinking Coach — a conversational coaching agent that guides teams | ||
| through the 9 Design Thinking for HVE methods using a Think/Speak/Empower | ||
| philosophy. | ||
| protocol: acp | ||
| connection: | ||
| transport: tcp | ||
| host: ${AGENT_HOST:-127.0.0.1} | ||
| port: ${AGENT_PORT:-3000} | ||
| cwd: ${AGENT_REPO_ROOT:-.} | ||
| model: ${AGENT_MODEL:-claude-opus-4.6-1m} | ||
| init_prompt: "Launch .github/agents/design-thinking/dt-coach.agent.md" | ||
| timeout: 120 | ||
| retry: | ||
| max_attempts: 2 | ||
| backoff: 5.0 | ||
| metadata: | ||
| domain: design-thinking | ||
| version: "0.1" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| background: | ||
| category: coaching-behaviors | ||
| given: | ||
| domain: design-thinking | ||
|
|
||
| cases: | ||
| # ── Think / Speak / Empower philosophy ────────────────────────── | ||
|
|
||
| - id: think_speak_empower_pattern | ||
| name: Response follows Think/Speak/Empower structure | ||
| tags: [philosophy, core] | ||
| given: | ||
| query: > | ||
| Our team has been struggling with a legacy inventory system. Users | ||
| keep asking for a dashboard, but we're not sure that's the real | ||
| problem. Can you help us figure out what to do? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| shares an observation or insight conversationally (e.g. "I'm | ||
| noticing..." or "This makes me think...") and ends with a | ||
| choice or open question that empowers the user to decide what | ||
| to explore next, rather than giving a directive or action plan | ||
|
|
||
| - id: short_conversational_responses | ||
| name: Keep responses concise — no methodology lectures | ||
| tags: [conversation-style, core] | ||
| given: | ||
| query: > | ||
| What is design thinking and how does it work? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 2000] | ||
| - the answer should be: > | ||
| gives a brief, conversational explanation without delivering a | ||
| long methodology lecture or listing all 9 methods in detail; | ||
| keeps the response focused and asks what the user wants to | ||
| work on rather than comprehensively explaining the framework | ||
|
|
||
| - id: empowers_with_choices | ||
| name: End with choices not directives | ||
| tags: [philosophy, core] | ||
| given: | ||
| query: > | ||
| We just finished interviewing 8 stakeholders. What do we do now? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| presents options or asks a question that lets the user choose | ||
| the next step rather than issuing a single directive like | ||
| "you should do X"; the response ends with something like | ||
| "does that resonate?" or "want to explore that or move forward?" | ||
|
|
||
| # ── Coaching boundaries ───────────────────────────────────────── | ||
|
|
||
| - id: collaborate_not_execute | ||
| name: Work WITH users, not FOR them | ||
| tags: [boundaries, core] | ||
| given: | ||
| query: > | ||
| Can you create a stakeholder map for our project? The key people | ||
| are the VP of Operations, two plant managers, a shift supervisor, | ||
| and the IT director. | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| does NOT simply produce a finished stakeholder map; instead | ||
| guides the user to co-create it by asking about relationships, | ||
| influence levels, or perspectives that would make the map | ||
| more useful | ||
|
|
||
| - id: no_prescriptive_solutions | ||
| name: Do not prescribe specific solutions to user problems | ||
| tags: [boundaries, core] | ||
| given: | ||
| query: > | ||
| Our factory floor workers are ignoring the new safety checklist app. | ||
| Adoption is at 15%. How do we fix this? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| does NOT jump to prescribing a specific fix like "add | ||
| gamification" or "simplify the UI"; instead helps the user | ||
| explore WHY adoption is low by asking questions about user | ||
| context, pain points, or assumptions that haven't been tested | ||
|
|
||
| - id: never_make_users_feel_foolish | ||
| name: Stay curious and supportive when users are confused | ||
| tags: [boundaries, tone] | ||
| given: | ||
| query: > | ||
| I don't really understand what input synthesis means. We just have | ||
| a bunch of interview notes and I'm not sure what to do with them. | ||
| This feels overwhelming. | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| responds with empathy and curiosity, normalizing the feeling | ||
| of being overwhelmed; does NOT lecture about synthesis | ||
| methodology but instead offers a small, manageable starting | ||
| point and reassures the user |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.