Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
af3f6ca
feat: add beval behavioral evaluation for dt-coach agent
eedorenko Mar 16, 2026
ef56eae
Update copilot command to use claude-opus model
eedorenko Mar 16, 2026
8faa4ea
Simplify agent startup command in beval.yml
eedorenko Mar 16, 2026
50a03dd
Modify copilot command to include prompt
eedorenko Mar 16, 2026
26fcbe7
Specify working directory for Start agent step
eedorenko Mar 16, 2026
b2feabd
fix: use init_prompt for agent activation, add identity case
eedorenko Mar 17, 2026
5a7ae11
fix: pin GitHub Actions dependencies to SHA hashes
eedorenko Mar 17, 2026
ade4c27
ci: trigger beval workflow test
eedorenko Mar 17, 2026
c708932
ci: add token debug workflow
eedorenko Mar 17, 2026
01849f7
ci: add token verification step to beval workflow
eedorenko Mar 17, 2026
de9e55e
ci: use claude-opus-4.6-fast model for agent and judge
eedorenko Mar 18, 2026
859fa91
ci: use claude-opus-4.6-1m model and add debug logging
eedorenko Mar 18, 2026
7e5afbe
ci: set AGENT_REPO_ROOT to absolute workspace path
eedorenko Mar 18, 2026
967c680
ci: temporarily run only agent_identity case
eedorenko Mar 18, 2026
3bf5071
ci: set model via ACP session instead of CLI flag
eedorenko Mar 18, 2026
b00d3f0
ci: remove token verification step and run full test suite
eedorenko Mar 18, 2026
4f1a9c2
chore: remove debug logging and agent_identity test case
eedorenko Mar 18, 2026
fcaf374
ci: install beval from default branch
eedorenko Mar 19, 2026
d662c71
Merge branch 'main' into eedorenko/beval
eedorenko Mar 19, 2026
d1c8b08
ci: fix spell check failures and workflow permissions
eedorenko Mar 19, 2026
b156cf7
ci: add beval to release pipeline and clean up debug artifacts
eedorenko Mar 19, 2026
86b028e
fix: resolve flatted prototype pollution vulnerability
eedorenko Mar 19, 2026
0b867a3
ci: temporarily run only agent identity smoke test
eedorenko Mar 19, 2026
6a61043
ci: restore full test suite after agent identity verification
eedorenko Mar 19, 2026
5a288b6
ci: pin Copilot CLI to exact version and use npm ci
eedorenko Mar 20, 2026
373b4c6
ci: pin beval install to specific commit SHA
eedorenko Mar 20, 2026
6f932cd
ci: replace --allow-all with least-privilege tool permissions
eedorenko Mar 20, 2026
a5f8c4b
ci: use explicit secret forwarding instead of secrets: inherit
eedorenko Mar 20, 2026
dc17252
ci: pin beval to fix for missing request_permission in ACPJudgeClient
eedorenko Mar 20, 2026
c343528
ci: omit permission flags from Copilot CLI ACP server
eedorenko Mar 20, 2026
3d491eb
ci: make beval non-blocking in PR and release workflows
eedorenko Mar 20, 2026
d44bab9
ci: scope COPILOT_GITHUB_TOKEN to agent and judge steps only
eedorenko Mar 20, 2026
d7e0019
ci: remove beval from PR and release pipelines
eedorenko Mar 20, 2026
d6e0e80
refactor: scope beval files under dt-coach subdirectory
eedorenko Mar 20, 2026
5ff0c54
chore: merge upstream/main
eedorenko Mar 20, 2026
5976fba
ci: fix beval SHA pin (correct full SHA for b92c200)
eedorenko Mar 20, 2026
78e32a0
test(beval): strengthen Method 8 case with role and project context
eedorenko Mar 20, 2026
93d6137
ci: update beval SHA pin to 1f01760 (fix import order)
eedorenko Mar 20, 2026
61dcbdd
Merge branch 'main' into eedorenko/beval
eedorenko Mar 20, 2026
d4e85fc
ci: allow @github/copilot packages in dependency review
eedorenko Mar 20, 2026
b743d2f
Merge branch 'main' into eedorenko/beval
eedorenko Mar 20, 2026
ca9daa1
ci: replace npm ci with exact-version global install for Copilot CLI
eedorenko Mar 20, 2026
a430a06
Merge branch 'main' into eedorenko/beval
eedorenko Mar 23, 2026
a3bf74d
ci: pin beval to main branch merge commit (a2effa1)
eedorenko Mar 23, 2026
e2b0414
Merge branch 'main' into eedorenko/beval
eedorenko Mar 23, 2026
70a0bd7
Merge branch 'main' into eedorenko/beval
eedorenko Mar 24, 2026
0e23ce4
Merge branch 'main' into eedorenko/beval
eedorenko Mar 24, 2026
172ef62
Merge branch 'main' into eedorenko/beval
eedorenko Mar 24, 2026
66fe5b1
ci: restore npm ci and exempt @github/copilot from license check
eedorenko Mar 24, 2026
30762ac
Merge branch 'main' into eedorenko/beval
eedorenko Mar 25, 2026
70d59c6
Merge branch 'main' into eedorenko/beval
eedorenko Mar 27, 2026
d02597d
Merge branch 'main' into eedorenko/beval
eedorenko Mar 30, 2026
0bdc402
Merge branch 'main' into eedorenko/beval
eedorenko Mar 30, 2026
feca948
Merge branch 'main' into eedorenko/beval
eedorenko Apr 1, 2026
41796da
Merge branch 'main' into eedorenko/beval
eedorenko Apr 1, 2026
9c256a4
Merge branch 'main' into eedorenko/beval
eedorenko Apr 2, 2026
b7035d4
fix: add missing comma in allow-dependencies-licenses list
eedorenko Apr 2, 2026
c5cb5e3
Merge branch 'main' into eedorenko/beval
eedorenko Apr 2, 2026
292f7d6
Merge branch 'main' into eedorenko/beval
eedorenko Apr 3, 2026
11fd4e7
Merge branch 'main' into eedorenko/beval
eedorenko Apr 6, 2026
eedd6cf
Merge branch 'main' into eedorenko/beval
eedorenko Apr 7, 2026
8a01cca
Merge branch 'main' into eedorenko/beval
eedorenko Apr 7, 2026
033d010
Merge branch 'main' into eedorenko/beval
eedorenko Apr 8, 2026
da126e6
Merge branch 'main' into eedorenko/beval
eedorenko Apr 9, 2026
2e96685
Merge branch 'main' into eedorenko/beval
eedorenko Apr 14, 2026
bb3a172
Merge branch 'main' into eedorenko/beval
eedorenko Apr 15, 2026
ca91e7a
Merge branch 'main' into eedorenko/beval
eedorenko Apr 17, 2026
d43c8dd
Merge branch 'main' into eedorenko/beval
eedorenko Apr 20, 2026
75b41ce
Merge branch 'main' into eedorenko/beval
eedorenko Apr 21, 2026
a39650a
Merge branch 'main' into eedorenko/beval
eedorenko Apr 22, 2026
519d4e8
fix: address review feedback from chaosdinosaur
eedorenko Apr 23, 2026
385c86e
chore: merge upstream/main and resolve conflicts
eedorenko Apr 23, 2026
36bcbd5
Merge branch 'main' into eedorenko/beval
eedorenko Apr 24, 2026
ba171ef
Merge branch 'main' into eedorenko/beval
eedorenko May 1, 2026
184e948
Merge branch 'main' into eedorenko/beval
eedorenko May 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions .cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@
"**/Cargo.lock",
"CHANGELOG.md",
"logs/**",
"docs/docusaurus/build/**"
"docs/docusaurus/build/**",
Comment thread
eedorenko marked this conversation as resolved.
"beval/**/results/**"
],
"ignoreRegExpList": [
"/#.*/g",
Expand Down Expand Up @@ -62,22 +63,25 @@
"general-technical"
],
"words": [
"agentic",
"atheris",
Comment thread
eedorenko marked this conversation as resolved.
"behaviour",
"behavioural",
"beval",
"brainwriting",
"clusterfuzzlite",
"collab",
"easyops",
"figjam",
"hideable",
"learning",
"parseable",
"smol",
"subcat",
"whiteboarding",
"wireframes",
"ˈpræksɪs",
"πρᾶξις",
"agentic"
"πρᾶξις"
],
"reporters": [
"default",
Expand Down
87 changes: 87 additions & 0 deletions .github/workflows/beval.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
name: Behavioral Evaluation (beval)

on:
workflow_call:
secrets:
COPILOT_TOKEN:
required: true
workflow_dispatch:

permissions:
contents: read
Comment thread
eedorenko marked this conversation as resolved.

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: false

jobs:
evaluate:
runs-on: ubuntu-latest
timeout-minutes: 30

env:
AGENT_REPO_ROOT: ${{ github.workspace }}

steps:
- name: Checkout repository
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v4.2.2
Comment thread
eedorenko marked this conversation as resolved.
with:
persist-credentials: false

- name: Set up Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.12"

- name: Install GitHub Copilot CLI
run: |
npm ci --prefix beval
echo "${{ github.workspace }}/beval/node_modules/.bin" >> "$GITHUB_PATH"

- name: Install beval
# beval is hosted under a personal account (vyta) while an org-owned
# home is evaluated. The install is pinned to a specific commit SHA to
# mitigate supply-chain risk in the interim.
run: pip install --no-cache-dir "beval[all] @ git+https://github.com/vyta/beval.git@a2effa10cec1b06c394811587fede0070174d589#subdirectory=python"
Comment thread
eedorenko marked this conversation as resolved.

- name: Start agent (TCP)
env:
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_TOKEN }}
run: |
copilot --acp --port 3000 &
Comment thread
eedorenko marked this conversation as resolved.
for i in $(seq 1 30); do
nc -z 127.0.0.1 3000 && break
echo "Waiting for agent to start ($i)..."
sleep 2
done
nc -z 127.0.0.1 3000 || { echo "Agent failed to start"; exit 1; }

- name: Start judge (TCP)
env:
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_TOKEN }}
run: |
copilot --acp --port 3001 &
for i in $(seq 1 30); do
nc -z 127.0.0.1 3001 && break
echo "Waiting for judge to start ($i)..."
sleep 2
done
nc -z 127.0.0.1 3001 || { echo "Judge failed to start"; exit 1; }

- name: Run evaluations
run: |
beval \
-c beval/dt-coach/eval.config.yaml \
run \
--cases beval/dt-coach/cases/ \
--agent beval/dt-coach/agent.yaml \
-m validation \
-o beval/dt-coach/results/results.json

- name: Upload results
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v4.4.3
if: always()
with:
name: beval-results-${{ github.run_id }}
path: beval/dt-coach/results/
retention-days: 30
10 changes: 10 additions & 0 deletions .github/workflows/dependency-review.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,22 @@ jobs:
WTFPL, LicenseRef-scancode-unicode
# Packages with compound SPDX expressions containing GPL or MPL
# from bundled code; distributed licenses are permissive.
# @github/copilot uses a non-SPDX proprietary license
# (LicenseRef-bad-see-license-in-license.md); it is GitHub's own
# CLI toolchain, deliberately used in beval.yml.
# pkg:npm/hve-core is the private root package (never published to npm).
allow-dependencies-licenses: >-
pkg:pypi/lxml,
pkg:pypi/typing-extensions,
pkg:npm/dompurify,
pkg:npm/lunr-languages,
pkg:npm/%40github/copilot,
pkg:npm/%40github/copilot-darwin-arm64,
pkg:npm/%40github/copilot-darwin-x64,
pkg:npm/%40github/copilot-linux-arm64,
pkg:npm/%40github/copilot-linux-x64,
pkg:npm/%40github/copilot-win32-arm64,
pkg:npm/%40github/copilot-win32-x64,
pkg:npm/hve-core
show-openssf-scorecard: true
warn-on-openssf-scorecard-level: 3
Expand Down
20 changes: 20 additions & 0 deletions beval/dt-coach/agent.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: dt-coach
description: >
Design Thinking Coach — a conversational coaching agent that guides teams
through the 9 Design Thinking for HVE methods using a Think/Speak/Empower
philosophy.
protocol: acp
connection:
transport: tcp
host: ${AGENT_HOST:-127.0.0.1}
port: ${AGENT_PORT:-3000}
cwd: ${AGENT_REPO_ROOT:-.}
model: ${AGENT_MODEL:-claude-opus-4.6-1m}
init_prompt: "Launch .github/agents/design-thinking/dt-coach.agent.md"
timeout: 120
retry:
max_attempts: 2
backoff: 5.0
metadata:
domain: design-thinking
version: "0.1"
130 changes: 130 additions & 0 deletions beval/dt-coach/cases/coaching-behaviors.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
background:
category: coaching-behaviors
given:
domain: design-thinking

cases:
# ── Think / Speak / Empower philosophy ──────────────────────────

- id: think_speak_empower_pattern
name: Response follows Think/Speak/Empower structure
tags: [philosophy, core]
given:
query: >
Our team has been struggling with a legacy inventory system. Users
keep asking for a dashboard, but we're not sure that's the real
problem. Can you help us figure out what to do?
stages:
- when: the agent processes the request
then:
- completion time should be under: 120
- when: the agent responds
then:
- response length should be: [50, 3000]
- the answer should be: >
shares an observation or insight conversationally (e.g. "I'm
noticing..." or "This makes me think...") and ends with a
choice or open question that empowers the user to decide what
to explore next, rather than giving a directive or action plan

- id: short_conversational_responses
name: Keep responses concise — no methodology lectures
tags: [conversation-style, core]
given:
query: >
What is design thinking and how does it work?
stages:
- when: the agent processes the request
then:
- completion time should be under: 120
- when: the agent responds
then:
- response length should be: [50, 2000]
- the answer should be: >
gives a brief, conversational explanation without delivering a
long methodology lecture or listing all 9 methods in detail;
keeps the response focused and asks what the user wants to
work on rather than comprehensively explaining the framework

- id: empowers_with_choices
name: End with choices not directives
tags: [philosophy, core]
given:
query: >
We just finished interviewing 8 stakeholders. What do we do now?
stages:
- when: the agent processes the request
then:
- completion time should be under: 120
- when: the agent responds
then:
- response length should be: [50, 3000]
- the answer should be: >
presents options or asks a question that lets the user choose
the next step rather than issuing a single directive like
"you should do X"; the response ends with something like
"does that resonate?" or "want to explore that or move forward?"

# ── Coaching boundaries ─────────────────────────────────────────

- id: collaborate_not_execute
name: Work WITH users, not FOR them
tags: [boundaries, core]
given:
query: >
Can you create a stakeholder map for our project? The key people
are the VP of Operations, two plant managers, a shift supervisor,
and the IT director.
stages:
- when: the agent processes the request
then:
- completion time should be under: 120
- when: the agent responds
then:
- response length should be: [50, 3000]
- the answer should be: >
does NOT simply produce a finished stakeholder map; instead
guides the user to co-create it by asking about relationships,
influence levels, or perspectives that would make the map
more useful

- id: no_prescriptive_solutions
name: Do not prescribe specific solutions to user problems
tags: [boundaries, core]
given:
query: >
Our factory floor workers are ignoring the new safety checklist app.
Adoption is at 15%. How do we fix this?
stages:
- when: the agent processes the request
then:
- completion time should be under: 120
- when: the agent responds
then:
- response length should be: [50, 3000]
- the answer should be: >
does NOT jump to prescribing a specific fix like "add
gamification" or "simplify the UI"; instead helps the user
explore WHY adoption is low by asking questions about user
context, pain points, or assumptions that haven't been tested

- id: never_make_users_feel_foolish
name: Stay curious and supportive when users are confused
tags: [boundaries, tone]
given:
query: >
I don't really understand what input synthesis means. We just have
a bunch of interview notes and I'm not sure what to do with them.
This feels overwhelming.
stages:
- when: the agent processes the request
then:
- completion time should be under: 120
- when: the agent responds
then:
- response length should be: [50, 3000]
- the answer should be: >
responds with empathy and curiosity, normalizing the feeling
of being overwhelmed; does NOT lecture about synthesis
methodology but instead offers a small, manageable starting
point and reassures the user
Loading
Loading