Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
e788fec
add CFE dataset
Mar 18, 2026
e3f5b30
make github action could validate cf-dataset
Mar 18, 2026
d1c355c
update
Mar 18, 2026
27aa061
Able to eidt git diff locally
Mar 18, 2026
0b196db
Add dataset
Mar 18, 2026
0688398
fix ruff error
Mar 18, 2026
7ea76d2
Add dataset and fix test
Mar 18, 2026
8584070
Change order
Mar 18, 2026
bf728d4
add dataset until 10 th
Mar 18, 2026
9fe5044
Merge branch 'main' of https://github.com/microsoft/BC-Bench into mas…
Mar 20, 2026
9e871ed
Add dataset from no.11-20
Mar 20, 2026
e1d174c
add dataset form 21 to 30
Mar 22, 2026
a53928b
fix dataset issue
Mar 22, 2026
5d0cc30
Merge branch 'main' of https://github.com/microsoft/BC-Bench into mas…
Mar 23, 2026
cb1ad9a
Add dataset form 31-40
Mar 25, 2026
6e47d79
Merge branch 'main' into master_thesis
Jiawen-CS Mar 26, 2026
bad3fc7
Merge branch 'main' of https://github.com/microsoft/BC-Bench into mas…
Mar 29, 2026
9d0d03c
Add dataset from 41 to 45
Mar 29, 2026
96302b4
Merge branch 'main' into master_thesis
Jiawen-CS Mar 30, 2026
4ce1a90
Merge branch 'main' of https://github.com/microsoft/BC-Bench into mas…
Mar 31, 2026
4579a27
Merge branch 'master_thesis' of https://github.com/microsoft/BC-Bench…
Apr 1, 2026
277a732
Add dataset from 46-50
Apr 1, 2026
f550572
Add dataset from 51 to 55
Apr 3, 2026
e04a6f9
Add dataset form 56 to 60 and allow validate specific commit dataset
Apr 3, 2026
6be58aa
Add dataset from 61 to 70
Apr 3, 2026
ca7d551
Add dataset from 71 to 80
Apr 3, 2026
14bc8f4
Add dataset from 81 to 90
Apr 5, 2026
bc20968
Add dataset from 90-101
Apr 5, 2026
e79dbee
Update issue dataset
Apr 5, 2026
d5aa412
Add experiment for cf
Apr 6, 2026
ed07a47
Merge origin/main: align counterfactual with dataset model refactor (…
Apr 10, 2026
f5b6866
Remove 2 failed CF entries, and and split CF evaluation for GitHub Ac…
Apr 10, 2026
e5957f3
fix: add include-counterfactual: false to CI and dataset-validation w…
Apr 10, 2026
0004b72
fix ruff issue to pass eval
Apr 10, 2026
2d45f79
split dataset
Apr 10, 2026
9fc0959
add functions for read cf entry
Apr 11, 2026
3e97843
change the get entries
Apr 11, 2026
0c4c63d
use bug-fix template for cfs
Apr 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
11 changes: 10 additions & 1 deletion .github/actions/setup-bc-container/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,16 @@ runs:
# Mask the password in GitHub Actions logs
Write-Output "::add-mask::$password"

"BC_CONTAINER_NAME=bcbench-$("${{ inputs.instance-id }}".Split('-')[1])" | Out-File -FilePath $env:GITHUB_ENV -Append
# Extract numeric ticket ID from instance-id, ignoring __cf-N suffix for counterfactual entries
# e.g. "microsoftInternal__NAV-210528__cf-1" -> "210528", "microsoft__BCApps-4699" -> "4699"
$instanceId = "${{ inputs.instance-id }}"
if ($instanceId -match '[A-Za-z]+-(\d+)') {
$ticketNumber = $Matches[1]
} else {
$ticketNumber = $instanceId.Split('-')[1]
}

"BC_CONTAINER_NAME=bcbench-$ticketNumber" | Out-File -FilePath $env:GITHUB_ENV -Append
"BC_CONTAINER_USERNAME=admin" | Out-File -FilePath $env:GITHUB_ENV -Append
"BC_CONTAINER_PASSWORD=$password" | Out-File -FilePath $env:GITHUB_ENV -Append
shell: pwsh
Expand Down
85 changes: 85 additions & 0 deletions .github/prompts/create-counterfactual.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
description: "Create counterfactual (CF) dataset entries for BC-Bench. Provide the base instance_id and describe the code changes for each variant."
---

# Create Counterfactual Dataset Entries

You are helping create counterfactual (CF) entries for the BC-Bench benchmark dataset.

## Context

Read these files first to understand the workflow:
- `COUNTERFACTUAL.md` — authoring guide
- `dataset/bcbench.jsonl` — find the base entry by instance_id
- `dataset/counterfactual.jsonl` — existing CF entries (match format/key ordering)

## Input Required from User

The user will provide:
1. **Base instance_id** — e.g. `microsoftInternal__NAV-224009`
2. **CF variants** — for each variant:
- What code changes to make in `test/after/` (test modifications)
- What code changes to make in `fix/after/` (fix modifications, often unchanged)
- A short variant description
- The failure layer (`L1-syntax-representation`, `L2-execution-validation`, `L3-event-driven-paradigm`, `L4-workflow-business-logic`, `L5-toolchain-ecosystem`) — classified post-hoc, not at creation time
3. **Problem statement** — either a pre-written README path or content to generate

## Workflow (per variant)

### Step 1: Analyze the base entry
```bash
python -c "import json; [print(json.dumps(json.loads(l), indent=2)) for l in open('dataset/bcbench.jsonl') if '<BASE_ID>' in l]"
```
- Understand the patch (fix) and test_patch (test) diffs
- Read the base problem statement from `dataset/problemstatement/<instance_id>/README.md`

### Step 2: Extract workspace
```bash
uv run bcbench dataset cf-extract <base_instance_id> -o cf-<short-name>
```
- Patch-only mode creates padded files — use `Get-Content ... | Where-Object { $_.Trim() }` to view content

### Step 3: Edit the after/ files
- Apply the user's described code changes to `test/after/` and/or `fix/after/`
- If the fix needs to be **reversed** (e.g. CF removes a filter instead of adding one), swap fix/before and fix/after contents:
```powershell
$before = Get-Content "fix\before\<path>" -Raw
$after = Get-Content "fix\after\<path>" -Raw
Set-Content "fix\before\<path>" -Value $after -NoNewline
Set-Content "fix\after\<path>" -Value $before -NoNewline
```
- Verify edits with `Get-Content ... | Where-Object { $_.Trim() }`

### Step 4: Create the CF entry
```bash
uv run bcbench dataset cf-create ./cf-<short-name> \
-d "<variant description>"
```

**This command automatically handles:**
- Patch regeneration from before/after files
- `FAIL_TO_PASS` auto-detection from [Test] procedures in test patch
- `PASS_TO_PASS` auto-population from the base entry
- Canonical key ordering in counterfactual.jsonl
- Problem statement directory scaffolding (copies base README **and all image/asset files** as template)

### Step 5: Edit problem statement README
- If user provided a pre-written README, copy it to the scaffolded directory at `dataset/problemstatement/<cf_instance_id>/README.md`
- Otherwise, edit the scaffolded README to describe the variant
- **Images & assets are copied automatically** by `cf-create`. Verify with `Get-ChildItem dataset/problemstatement/<cf_instance_id>/` that all referenced images are present.

### Step 6: Verify
```bash
uv run pytest tests/test_dataset_integrity.py tests/test_counterfactual.py -q
```
Confirm all tests pass. Then briefly show the created entry's key fields.

## Key Rules
- Fix patch is usually **unchanged** from base (same bug fix, different test scenario)
- If the CF requires a **different** fix, the fix/after file should contain the CF's gold fix code
- Test patch is the primary thing that changes between variants
- **No manual key reordering needed** — cf-create handles this automatically
- **No manual PASS_TO_PASS needed** — cf-create copies from base entry automatically
- Problem statement directory naming: `<base_id>__cf-N` (double underscore + hyphen)

{{{ input }}}
1 change: 1 addition & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ jobs:
with:
test-run: true
category: ${{ needs.select-category.outputs.category }}
include-counterfactual: false

mock-evaluation:
runs-on: ubuntu-latest
Expand Down
24 changes: 23 additions & 1 deletion .github/workflows/claude-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "counterfactual-evaluation"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand All @@ -33,6 +34,24 @@ on:
required: false
default: false
type: boolean
batch:
description: "Batch index (1-based) for splitting large datasets (0=no splitting)"
required: false
default: "0"
type: choice
options:
- "0"
- "1"
- "2"
- "3"
batch-count:
description: "Total number of batches to split into (0=no splitting)"
required: false
default: "0"
type: choice
options:
- "0"
- "3"
repeat:
description: "Number of times to run sequentially (ignored for test runs)"
required: false
Expand All @@ -58,6 +77,9 @@ jobs:
with:
test-run: ${{ inputs.test-run }}
category: ${{ inputs.category }}
include-counterfactual: false
batch: ${{ inputs.batch && fromJSON(inputs.batch) || 0 }}
batch-count: ${{ inputs.batch-count && fromJSON(inputs.batch-count) || 0 }}

evaluate-with-claude-code:
runs-on: [GitHub-BCBench]
Expand Down Expand Up @@ -154,4 +176,4 @@ jobs:
workflow-file: claude-evaluation.yml
repeat: ${{ inputs.repeat }}
workflow-inputs: |
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}"}
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}", "batch": "${{ inputs.batch }}", "batch-count": "${{ inputs.batch-count }}"}
24 changes: 23 additions & 1 deletion .github/workflows/copilot-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "counterfactual-evaluation"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand All @@ -41,6 +42,24 @@ on:
required: false
default: false
type: boolean
batch:
description: "Batch index (1-based) for splitting large datasets (0=no splitting)"
required: false
default: "0"
type: choice
options:
- "0"
- "1"
- "2"
- "3"
batch-count:
description: "Total number of batches to split into (0=no splitting)"
required: false
default: "0"
type: choice
options:
- "0"
- "3"
repeat:
description: "Number of times to run sequentially (ignored for test runs)"
required: false
Expand All @@ -66,6 +85,9 @@ jobs:
with:
test-run: ${{ inputs.test-run }}
category: ${{ inputs.category }}
include-counterfactual: false
batch: ${{ inputs.batch && fromJSON(inputs.batch) || 0 }}
batch-count: ${{ inputs.batch-count && fromJSON(inputs.batch-count) || 0 }}

evaluate-with-copilot-cli:
runs-on: [GitHub-BCBench]
Expand Down Expand Up @@ -168,4 +190,4 @@ jobs:
workflow-file: copilot-evaluation.yml
repeat: ${{ inputs.repeat }}
workflow-inputs: |
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}"}
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}", "batch": "${{ inputs.batch }}", "batch-count": "${{ inputs.batch-count }}"}
56 changes: 55 additions & 1 deletion .github/workflows/dataset-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,35 @@ permissions:
contents: read

on:
push:
branches:
- master_thesis
paths:
- "dataset/**"
workflow_dispatch:
inputs:
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
default: true
type: boolean
base-ref:
description: "Git ref to diff against for modified-only (e.g., HEAD~1)"
required: false
default: "origin/main"
type: string
schedule:
- cron: "0 0 * * 0"

jobs:
get-entries:
uses: ./.github/workflows/get-entries.yml
with:
modified-only: false
modified-only: ${{ github.event_name == 'push' }}
base-ref: ${{ inputs.base-ref || 'HEAD~1' }}
test-run: ${{ inputs.test-run || false }}
category: "bug-fix"
include-counterfactual: false

verify-build-and-tests:
runs-on: [GitHub-BCBench]
Expand Down Expand Up @@ -54,3 +66,45 @@ jobs:
timeout-minutes: 60
run: .\scripts\Verify-BuildAndTests.ps1 -InstanceId "${{ matrix.entry }}" -RepoPath "${{ steps.setup-env.outputs.repo_path }}"
shell: pwsh

get-cf-entries:
uses: ./.github/workflows/get-entries.yml
with:
modified-only: ${{ github.event_name == 'push' }}
base-ref: ${{ inputs.base-ref || 'HEAD~1' }}
test-run: ${{ inputs.test-run || false }}
category: "counterfactual-evaluation"

verify-counterfactual-entries:
runs-on: [GitHub-BCBench]
needs: get-cf-entries
if: needs.get-cf-entries.outputs.entries != '[]'
environment:
name: ado-read
deployment: false
permissions:
contents: read
id-token: write
name: cf-${{ matrix.entry }}
strategy:
fail-fast: false
matrix:
entry: ${{ fromJson(needs.get-cf-entries.outputs.entries) }}
steps:
- name: Checkout repository
uses: actions/checkout@v5

- name: Setup BC container
id: setup-env
timeout-minutes: 40
uses: ./.github/actions/setup-bc-container
with:
instance-id: ${{ matrix.entry }}
azure-client-id: ${{ secrets.AZURE_CLIENT_ID }}
azure-tenant-id: ${{ secrets.AZURE_TENANT_ID }}
github-token: ${{ secrets.GITHUB_TOKEN }}

- name: Run build and test verification for ${{ matrix.entry }}
timeout-minutes: 60
run: .\scripts\Verify-BuildAndTests.ps1 -InstanceId "${{ matrix.entry }}" -RepoPath "${{ steps.setup-env.outputs.repo_path }}"
shell: pwsh
32 changes: 31 additions & 1 deletion .github/workflows/get-entries.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ on:
required: false
type: boolean
default: false
base-ref:
description: Git ref to diff against when using modified-only (e.g., HEAD~1, a commit SHA)
required: false
type: string
default: origin/main
test-run:
description: Indicate this is a test run (with 2 entries)
required: false
Expand All @@ -20,6 +25,21 @@ on:
required: true
type: string
default: "bug-fix"
include-counterfactual:
description: Include counterfactual entries from counterfactual.jsonl
required: false
type: boolean
default: true
batch:
description: "Batch index (1-based) for splitting large datasets across runs"
required: false
type: number
default: 0
batch-count:
description: "Total number of batches (0 = no splitting)"
required: false
type: number
default: 0
outputs:
entries:
description: JSON array of dataset entries
Expand All @@ -45,9 +65,19 @@ jobs:
cmd="uv run bcbench dataset list --category ${{ inputs.category }} --github-output entries"

if [[ "${{ inputs.modified-only }}" == "true" ]]; then
cmd="$cmd --modified-only"
cmd="$cmd --modified-only --base-ref '${{ inputs.base-ref }}'"
elif [[ "${{ inputs.test-run }}" == "true" ]]; then
cmd="$cmd --test-run"
fi

if [[ "${{ inputs.include-counterfactual }}" == "false" ]]; then
cmd="$cmd --no-include-counterfactual"
fi

if [[ "${{ inputs.batch-count }}" != "0" ]]; then
cmd="$cmd --batch ${{ inputs.batch }} --batch-count ${{ inputs.batch-count }}"
fi

echo "Running: $cmd"
eval "$cmd"
echo "entries output: $(cat $GITHUB_OUTPUT)"
1 change: 1 addition & 0 deletions .github/workflows/mini-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:
with:
test-run: ${{ inputs.test-run }}
category: ${{ inputs.category }}
include-counterfactual: false

evaluate-with-mini-agent:
runs-on: [GitHub-BCBench]
Expand Down
Loading
Loading