Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions evaluation/PBVerified/20260312_pando_gpt-5.2-codex/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# pando SWE-PolyBench Verified Submission

This folder is prepared for the `AmazonScience/SWE-PolyBench_Verified` split.

Split notes:
- This package targets the verified dataset, not the full `PB` or sampled `PB500` datasets.
- The live `AmazonScience/SWE-PolyBench_Verified` dataset page shows `382` rows in the `test` split.
- The language counts shown on that page are JavaScript `100`, TypeScript `100`, Python `113`, and Java `69`, which also sum to `382`.
- The dataset card text also mentions `394` verified instances, but that conflicts with both the displayed split size and the per-language counts. This submission therefore uses `382` as the operative verified-set size.

Model:
- `gpt-5.2-codex`

Contents:
- `all_preds.jsonl`: 382 verified-set entries
- `logs/`: 100 TypeScript evaluation result files for the verified TS subset
- `trajs/`: 100 reasoning traces for the verified TS subset
- `metadata.yaml`: leaderboard metadata

Submission composition:
- Non-empty predictions are included only for the 100 verified TypeScript tasks.
- All non-TypeScript verified tasks are present in `all_preds.jsonl` with empty patches.
- The TypeScript prediction source of truth is `vscode_ts_series_run/predictions_pbv_ts.jsonl`.
- Those 100 predictions exactly match the verified TypeScript task IDs.
- For the 7 rerun instances in `ts_pbv_retry_run/`, the retry trajectories are preferred in `trajs/`.

Layout note:
- The public SWE-PolyBench submission README documents `evaluation/PB` and `evaluation/PB500`, but does not currently document a verified-specific submission path.
- This folder uses `evaluation/PBVerified/...` to keep the verified submission separate from the full-benchmark `PB` package.

Pass rates:
- Overall verified submission rate with empty non-TypeScript patches: `12.30% (47/382)`
- TypeScript-only verified rate: `47.00% (47/100)`
- `metadata.yaml` uses the evaluated TypeScript subset rate, following the convention used by partial verified submissions already present on the `submission` branch.

Source artifacts used:
- `vscode_ts_series_run/predictions_pbv_ts.jsonl`
- `vscode_ts_series_run/eval/*_result.json`
- `vscode_ts_series_run/worker-*/logs/*.events.jsonl`
- `ts_pbv_retry_run/logs/*.events.jsonl`
382 changes: 382 additions & 0 deletions evaluation/PBVerified/20260312_pando_gpt-5.2-codex/all_preds.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"instance_id": "angular__angular-37561",
"patch_applied": true,
"generation": true,
"with_logs": true,
"all_f2p_passed": false,
"no_p2p_failed": true,
"resolved": false,
"passed_tests": [],
"failed_tests": [
"/packages/core/test/render3:render3"
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
{
"instance_id": "coder__code-server-3277",
"patch_applied": true,
"generation": true,
"with_logs": true,
"all_f2p_passed": false,
"no_p2p_failed": true,
"resolved": false,
"passed_tests": [
"/testbed/test/unit/serviceWorker.test.ts->serviceWorker should add 3 listeners: install, activate and fetch",
"/testbed/test/unit/serviceWorker.test.ts->serviceWorker should call the proper callbacks for 'install'",
"/testbed/test/unit/serviceWorker.test.ts->serviceWorker should do nothing when 'fetch' is called",
"/testbed/test/unit/serviceWorker.test.ts->serviceWorker should call the proper callbacks for 'activate'",
"/testbed/test/unit/http.test.ts->http HttpCode should return the correct HTTP codes",
"/testbed/test/unit/http.test.ts->http HttpError should work as expected",
"/testbed/test/unit/http.test.ts->http HttpError should have details if provided",
"/testbed/test/unit/constants.test.ts->constants getPackageJson should log a warning if package.json not found",
"/testbed/test/unit/constants.test.ts->constants getPackageJson should find the package.json",
"/testbed/test/unit/constants.test.ts->constants version should return the package.json version",
"/testbed/test/unit/constants.test.ts->constants commit should return 'development' if commit is undefined",
"/testbed/test/unit/constants.test.ts->test constants tmpdir should return a temp directory",
"/testbed/test/unit/emitter.test.ts->emitter should run the correct callbacks",
"/testbed/test/unit/emitter.test.ts->emitter should log an error if something goes wrong",
"/testbed/test/unit/socket.test.ts->SocketProxyProvider should work without a proxy",
"/testbed/test/unit/socket.test.ts->SocketProxyProvider should work with a proxy",
"/testbed/test/unit/socket.test.ts->SocketProxyProvider should close",
"/testbed/test/unit/update.test.ts->update should get the latest",
"/testbed/test/unit/update.test.ts->update should keep existing information",
"/testbed/test/unit/update.test.ts->update should force getting the latest",
"/testbed/test/unit/update.test.ts->update should get latest after interval passes",
"/testbed/test/unit/update.test.ts->update should check if it's the current version",
"/testbed/test/unit/update.test.ts->update should not reject if unable to fetch",
"/testbed/test/unit/register.test.ts->register when navigator and serviceWorker are defined test should have access to browser globals from beforeAll",
"/testbed/test/unit/register.test.ts->register when navigator and serviceWorker are defined should register a ServiceWorker",
"/testbed/test/unit/register.test.ts->register when navigator and serviceWorker are defined should log an error if something doesn't work",
"/testbed/test/unit/register.test.ts->register registerServiceWorker should register when options.base is undefined",
"/testbed/test/unit/register.test.ts->register registerServiceWorker should register when options.base is defined",
"/testbed/test/unit/routes/login.test.ts->login RateLimiter should allow one try ",
"/testbed/test/unit/routes/login.test.ts->login RateLimiter should pull tokens from both limiters (minute & hour)",
"/testbed/test/unit/routes/login.test.ts->login RateLimiter should not allow more than 14 tries in less than an hour",
"/testbed/test/unit/cli.test.ts->parser should parse nothing",
"/testbed/test/unit/cli.test.ts->parser should parse all available options",
"/testbed/test/unit/cli.test.ts->parser should work with short options",
"/testbed/test/unit/cli.test.ts->parser should use log level env var",
"/testbed/test/unit/cli.test.ts->parser should prefer --log to env var and --verbose to --log",
"/testbed/test/unit/cli.test.ts->parser should ignore invalid log level env var",
"/testbed/test/unit/cli.test.ts->parser should error if value isn't provided",
"/testbed/test/unit/cli.test.ts->parser should error if value is invalid",
"/testbed/test/unit/cli.test.ts->parser should error if the option doesn't exist",
"/testbed/test/unit/cli.test.ts->parser should not error if the value is optional",
"/testbed/test/unit/cli.test.ts->parser should not allow option-like values",
"/testbed/test/unit/cli.test.ts->parser should allow positional arguments before options",
"/testbed/test/unit/cli.test.ts->parser should support repeatable flags",
"/testbed/test/unit/cli.test.ts->parser should enforce cert-key with cert value or otherwise generate one",
"/testbed/test/unit/cli.test.ts->parser should override with --link",
"/testbed/test/unit/cli.test.ts->parser should use env var password",
"/testbed/test/unit/cli.test.ts->parser should use env var hashed password",
"/testbed/test/unit/cli.test.ts->parser should filter proxy domains",
"/testbed/test/unit/cli.test.ts->cli should use existing if inside code-server",
"/testbed/test/unit/cli.test.ts->cli should use existing if --reuse-window is set",
"/testbed/test/unit/cli.test.ts->cli should use existing if --new-window is set",
"/testbed/test/unit/cli.test.ts->cli should use existing if no unrelated flags are set, has positional, and socket is active"
],
"failed_tests": [
"/testbed/test/unit/register.test.ts->register when navigator and serviceWorker are NOT defined should log an error"
]
}
Loading