Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: up down test lint smoke eval install-dev check
.PHONY: up down test lint smoke eval eval-verified install-dev check

up:
docker compose up --build
Expand Down Expand Up @@ -26,3 +26,6 @@ smoke:

eval:
python3 scripts/eval/run_eval.py

eval-verified:
python3 scripts/eval/run_eval_verified.py
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,10 +130,16 @@ Recommended env: `AI_PROVIDER=fake` and `DEBUG=true` (needed for citation valida
```bash
make eval
```
3. Run the verified eval harness:
```bash
make eval-verified
```

Outputs are written to:
- `scripts/eval/out/eval_results.json`
- `scripts/eval/out/eval_report.md`
- `scripts/eval/out/eval_verified_results.json`
- `scripts/eval/out/eval_verified_report.md`

## Notes

Expand Down
72 changes: 72 additions & 0 deletions scripts/eval/golden_verified.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
[
{
"id": "verified-001",
"question": "What is the document titled?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["Long-Form Content Intelligence Engine"]
},
{
"id": "verified-002",
"question": "Is this described as a sample document?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["sample document"]
},
{
"id": "verified-003",
"question": "What is the document used for?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["smoke testing"]
},
{
"id": "verified-004",
"question": "What does it describe about retrieval?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["retrieval"]
},
{
"id": "verified-005",
"question": "Does the text mention citations?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["citations"]
},
{
"id": "verified-006",
"question": "Does it mention deterministic behavior?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["deterministic behavior"]
},
{
"id": "verified-007",
"question": "Summarize the document in one sentence.",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["Long-Form", "sample document"]
},
{
"id": "verified-008",
"question": "Which engine is referenced?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["Content Intelligence Engine"]
},
{
"id": "verified-009",
"question": "Does the document describe citations and retrieval?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["citations", "retrieval"]
},
{
"id": "verified-010",
"question": "What behavior is described as deterministic?",
"expected_behavior": "ANSWERABLE",
"must_include_keywords": ["deterministic behavior"]
},
{
"id": "verified-011",
"question": "Who is the current CEO of OpenAI?",
"expected_behavior": "INSUFFICIENT_EVIDENCE"
},
{
"id": "verified-012",
"question": "What is the capital of France?",
"expected_behavior": "INSUFFICIENT_EVIDENCE"
}
]
Loading