Skip to content
Merged

dev #138

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
166 commits
Select commit Hold shift + click to select a range
e543499
model names
pelikhan Jun 2, 2025
91d8a5f
Multiple eval models (#139)
bzorn Jun 4, 2025
7fe2a52
✨ Enabling separation of test gen, test runs, and test run evaluation…
bzorn Jun 4, 2025
e5fad83
removed pull
pelikhan Jun 5, 2025
796b27c
getting started on github models support
pelikhan Jun 5, 2025
458873f
✨ Add scripts and logic for multi-stage sample evaluations (#143)
bzorn Jun 5, 2025
0057f5b
passing test data
pelikhan Jun 5, 2025
eb31c1d
✨ Add support for groundtruth model and outputs
bzorn Jun 5, 2025
44cf117
upgrade deps
pelikhan Jun 5, 2025
2152903
migrate to node v22
pelikhan Jun 5, 2025
bb597eb
wiring up action
pelikhan Jun 5, 2025
dc5c4e4
add files argument
pelikhan Jun 5, 2025
7d5ab36
✨ feat: add groundtruth fields to test data pipeline
bzorn Jun 5, 2025
f425a16
use promt for noe
pelikhan Jun 5, 2025
db6d3ea
define action
pelikhan Jun 5, 2025
3e3d469
Merge remote-tracking branch 'origin/main' into action
pelikhan Jun 5, 2025
a60970b
fid build
pelikhan Jun 5, 2025
9d4c99d
fix test
pelikhan Jun 5, 2025
3c4c7c8
Merge pull request #147 from microsoft/action
pelikhan Jun 6, 2025
e0aedb2
fix build
pelikhan Jun 6, 2025
3af89be
Merge remote-tracking branch 'origin/dev' into add-ground-truth
pelikhan Jun 6, 2025
0b4440b
cleanup
pelikhan Jun 6, 2025
1ebed36
✨ Enhance model suggestions and test config options
pelikhan Jun 6, 2025
2d400f7
integrate groundtruth in run test
pelikhan Jun 6, 2025
1c18cd0
Merge pull request #146 from microsoft/add-ground-truth
pelikhan Jun 6, 2025
5a1437d
Merge remote-tracking branch 'origin/dev' into githubmodels
pelikhan Jun 6, 2025
1b8ce71
Create genai-issue-labeller
pelikhan Jun 6, 2025
cfe006f
Merge pull request #150 from microsoft/pelikhan-patch-1
pelikhan Jun 6, 2025
4e57915
Merge remote-tracking branch 'origin/dev' into githubmodels
pelikhan Jun 6, 2025
fa1daad
updated typenames
pelikhan Jun 6, 2025
b393184
import github models
pelikhan Jun 6, 2025
da90957
✨ add groundtruth-based scoring and evaluation prompt (#151)
bzorn Jun 6, 2025
8f3f536
🔗 docs: update overview link to correct destination
bzorn Jun 6, 2025
dc4dbb2
Merge remote-tracking branch 'origin/dev' into githubmodels
pelikhan Jun 9, 2025
3390e75
added prompt
pelikhan Jun 9, 2025
0d35c0e
merge
pelikhan Jun 9, 2025
ce1099d
♻️ refactor: Simplify evalModel assignment logic
pelikhan Jun 9, 2025
a00499f
:sparkles: Add gitmoji support and improve description clarity
pelikhan Jun 9, 2025
ea43136
convert on intake
pelikhan Jun 9, 2025
49e7e00
add imported field
pelikhan Jun 9, 2025
2bc678b
fix
pelikhan Jun 9, 2025
64aae36
rename openai evals
pelikhan Jun 9, 2025
5fd14a6
✨ Add initial integration for GitHub Models eval runs
pelikhan Jun 9, 2025
c484c79
✨: Persist generated eval prompt YAML for each run
pelikhan Jun 9, 2025
7d3b63d
add models under test
pelikhan Jun 9, 2025
9e1d0b9
🎨 refactor model handling and improve prompt logging
pelikhan Jun 9, 2025
556ca02
✨: Await toModelsPrompt call in githubModelsEvalsGenerate
pelikhan Jun 9, 2025
a5e0229
Merge pull request #148 from microsoft/githubmodels
pelikhan Jun 9, 2025
b683f89
♻️ Refactor test filtering logic to use filteredTests
pelikhan Jun 9, 2025
25727f9
new schema
pelikhan Jun 9, 2025
aadebcf
📝 docs: Explain --filterTestCount effect, show output files
bzorn Jun 9, 2025
03b1214
concurrency build for actions
pelikhan Jun 9, 2025
8b65c6b
Merge pull request #153 from microsoft/documentation-updates
pelikhan Jun 9, 2025
89c5517
tweaks
pelikhan Jun 9, 2025
c2d040b
v22
pelikhan Jun 9, 2025
3a472a6
Merge branch 'dev' of https://github.com/microsoft/promptpex into dev
pelikhan Jun 9, 2025
99a35aa
moving to lts-alpine
pelikhan Jun 9, 2025
fd930ef
reorg docs
pelikhan Jun 9, 2025
0b2ea82
ordering
pelikhan Jun 9, 2025
a7e58f0
gh action test
pelikhan Jun 9, 2025
17bd3ae
typo
pelikhan Jun 9, 2025
323237d
Promptpex-refactor (#155)
pelikhan Jun 9, 2025
86fc03e
updated options parsing
pelikhan Jun 9, 2025
b34ea2b
export evals after groundtruth
pelikhan Jun 9, 2025
a531315
Refactor updates (#156)
bzorn Jun 10, 2025
ea37a66
fix extension
pelikhan Jun 10, 2025
089ed3b
line intent, rules, inputSpec in openai evals
pelikhan Jun 10, 2025
3dd1560
better metric generation
pelikhan Jun 10, 2025
574df28
don't use scoring models
pelikhan Jun 10, 2025
711a33d
missing scorer
pelikhan Jun 10, 2025
6aa5019
filter with gihub tag
pelikhan Jun 10, 2025
319c48c
refactor: export scoring output formats and add GitHub models output …
pelikhan Jun 10, 2025
3c9ab4a
auto-generation of suite
pelikhan Jun 10, 2025
15de6eb
added github models eval
pelikhan Jun 10, 2025
4374e57
effort medium
pelikhan Jun 10, 2025
c9c6494
add more github models demos
pelikhan Jun 10, 2025
b627361
escape template when inline prompt
pelikhan Jun 10, 2025
975c0f2
github models docs
pelikhan Jun 10, 2025
b4feed0
reduce default noise
pelikhan Jun 10, 2025
1a07680
add summarizer example
pelikhan Jun 10, 2025
1d8b3d0
✨ Refine output tables by omitting empty or undefined values
pelikhan Jun 10, 2025
c999717
retain original prompt
pelikhan Jun 10, 2025
dbab722
limit table sizes
pelikhan Jun 10, 2025
d7f8909
remove extra file
pelikhan Jun 10, 2025
ba51f37
reorganizing prompts
pelikhan Jun 10, 2025
c7f3eab
don't lose original test samples
pelikhan Jun 10, 2025
76bfd4e
✨: Add output fence for github models eval command
pelikhan Jun 10, 2025
0a27811
✨: Expand metric file discovery across directories
pelikhan Jun 10, 2025
3e2e847
implementing cli package
pelikhan Jun 10, 2025
67c83ec
adding bin script
pelikhan Jun 10, 2025
9b4d158
add a few options
pelikhan Jun 10, 2025
a29022a
add basic run/configure cli
pelikhan Jun 10, 2025
4376053
fix typo in config
pelikhan Jun 10, 2025
160d53c
:building_construction: Standardize rules model naming across codebase
pelikhan Jun 10, 2025
7618876
fix docs
pelikhan Jun 10, 2025
cb58f19
updated keywords
pelikhan Jun 11, 2025
b59a57c
💡 improve: Display usage/help when no prompts input
pelikhan Jun 11, 2025
6e48a13
0.0.8
pelikhan Jun 11, 2025
2f118e3
✨ feat: Add support for configuring ground truth eval models (#159)
bzorn Jun 11, 2025
22317e0
use node runner
pelikhan Jun 11, 2025
aa81904
Merge branch 'dev' of https://github.com/microsoft/promptpex into dev
pelikhan Jun 11, 2025
1cdacaf
add docker info
pelikhan Jun 11, 2025
bdfa393
set default aliases
pelikhan Jun 11, 2025
a5fe1f9
docs
pelikhan Jun 11, 2025
05c2c0d
✨ Enhance groundtruth evaluation with multiple eval models (#162)
bzorn Jun 12, 2025
673460f
✨ feat: Enable separate groundtruth metric evaluation path (#163)
bzorn Jun 13, 2025
8d73129
📝 docs: Add comprehensive Test Groundtruth reference docs (#164)
bzorn Jun 13, 2025
2ee522a
always generate a github-models output
pelikhan Jun 16, 2025
79931cb
store reasoning in reports, generated tests
pelikhan Jun 16, 2025
320979b
action test
pelikhan Jun 17, 2025
fe3573d
removed duplicate ci
pelikhan Jun 17, 2025
7a08e9e
set env on tests
pelikhan Jun 17, 2025
f624c94
add test dependency
pelikhan Jun 17, 2025
2ee5857
upgrade version
pelikhan Jun 17, 2025
9027032
node v22
pelikhan Jun 17, 2025
26f81f3
remove a few workflows
pelikhan Jun 17, 2025
16a2cde
debug log
pelikhan Jun 17, 2025
982c342
merge workflows
pelikhan Jun 17, 2025
8f257e6
update test script to specify model for consistency
pelikhan Jun 17, 2025
b4d9f2b
model type
pelikhan Jun 17, 2025
d5f6097
update permissions to include models read access
pelikhan Jun 17, 2025
5754b73
lower noise
pelikhan Jun 17, 2025
6a7f0ba
smaller model for testing
pelikhan Jun 17, 2025
45db082
full path to prompt
pelikhan Jun 17, 2025
d55917d
cache
pelikhan Jun 17, 2025
57482e1
fix: handle undefined originalPrompt in toModelsPrompt function
pelikhan Jun 17, 2025
8966fb3
per workflow cache
pelikhan Jun 17, 2025
5a4d189
fix: update release.sh file permissions to executable
pelikhan Jun 17, 2025
1fdb100
update release
pelikhan Jun 17, 2025
f4c481c
fix: update dependencies for genaiscript and openai to latest versions
pelikhan Jun 17, 2025
ea67aa6
chore: bump version to 0.0.11
pelikhan Jun 17, 2025
c4efacb
✨: Add groundtruthScore to test results and improve logging (#166)
bzorn Jun 17, 2025
013b027
Update results (#168)
bzorn Jun 19, 2025
e71173d
feat: update groundtruth documentation and bump genaiscript version t…
pelikhan Jun 19, 2025
6a00094
feat: add Groundtruth card to documentation for expected output gener…
pelikhan Jun 19, 2025
d186f93
support for authors
pelikhan Jun 19, 2025
c775b1e
feat: allow 'serve' command in addition to 'configure' for genaiscrip…
pelikhan Jun 19, 2025
f6460cc
feat: add support for 'serve' command in genaiscript execution and up…
pelikhan Jun 19, 2025
4cd6471
better error message on missing run
pelikhan Jun 19, 2025
1f633e2
refactor: remove unused renderEvaluation and renderEvaluationOutcome …
pelikhan Jun 19, 2025
78d9f5c
feat: implement parseStrings function for flexible string parsing
pelikhan Jun 19, 2025
fb74bcd
use .bin path
pelikhan Jun 19, 2025
72ddf67
feat: add front matter schema for prompty definition and update groun…
pelikhan Jun 19, 2025
f014ad8
refactor: remove npm test step from CI workflow
pelikhan Jun 19, 2025
77f3afa
chore: bump version to 0.0.12
pelikhan Jun 19, 2025
5714aaf
📝 docs: Add and clarify Ground Truth test terminology (#172)
bzorn Jun 19, 2025
89916ad
feat: add GenAI Pull Request Descriptor workflow
pelikhan Jun 19, 2025
a5f880a
Groundtruth flag (#173)
pelikhan Jun 19, 2025
3fe8cd7
refactor: update groundtruth handling and improve code clarity
pelikhan Jun 19, 2025
dfe7942
refactor: improve code formatting and enhance groundtruth handling
pelikhan Jun 19, 2025
b252280
more options work
pelikhan Jun 19, 2025
109a56a
fix: update groundtruth assignment in evaluateTestMetric function
pelikhan Jun 20, 2025
e1f4af3
Planner (#176)
pelikhan Jun 20, 2025
9ab58a2
✨ Label tests with unique IDs and propagate testuid (#175)
bzorn Jun 20, 2025
7cbe0b3
refactor: update documentation structure and add implementation plan
pelikhan Jun 20, 2025
46d6571
fix: update implementation plan link in documentation
pelikhan Jun 20, 2025
3537e2e
✨ add groundtruth test output support to promptpex (#180)
bzorn Jun 20, 2025
cfaf11b
Update dependencies and scripts in package.json
pelikhan Jun 25, 2025
548319b
Merge branch 'dev' of https://github.com/microsoft/promptpex into dev
pelikhan Jun 25, 2025
3373562
chore: update genaiscript dependency to version 2.0.9
pelikhan Jun 25, 2025
fac638b
upgrade to genaiscript 2.0
pelikhan Jun 30, 2025
51db6a0
chore: bump version to 0.0.13
pelikhan Jun 30, 2025
9c555ca
new docker
pelikhan Jul 1, 2025
e7fa425
chore: add Dockerfile for serving promptpex
pelikhan Jul 1, 2025
3b23128
chore: add Dockerfile for serving genaiscript action
pelikhan Jul 1, 2025
7bc4c54
chore: update Dockerfile and package.json to expose port 8003 for ser…
pelikhan Jul 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"image": "mcr.microsoft.com/devcontainers/javascript-node:20",
"image": "mcr.microsoft.com/devcontainers/javascript-node:22",
"features": {
"ghcr.io/devcontainers/features/common-utils:2": {},
"ghcr.io/devcontainers/features/git:1": {},
Expand Down
18 changes: 18 additions & 0 deletions .env.ollama
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
GENAISCRIPT_MODEL_LARGE="ollama:llama3.3"
GENAISCRIPT_MODEL_SMALL="ollama:qwen2.5:3b"
GENAISCRIPT_MODEL_TINY="ollama:llama3.2:1b"
GENAISCRIPT_MODEL_VISION="azure:gpt-4o_2024-11-20"
GENAISCRIPT_MODEL_VISION_SMALL="azure:gpt-4o-mini_2024-11-20"
GENAISCRIPT_MODEL_REASONING="azure:o1_2024-12-17"
GENAISCRIPT_MODEL_REASONING_SMALL="azure:o3-mini_2025-01-31"
GENAISCRIPT_MODEL_IMAGE="azure:dall-e-3_30"
GENAISCRIPT_MODEL_TRANSCRIPTION="azure:whisper_001"
GENAISCRIPT_MODEL_EMBEDDINGS="azure:text-embedding-ada-002_2"
GENAISCRIPT_MODEL_EVAL1="azure:gpt-4o_2024-11-20"
GENAISCRIPT_MODEL_EVAL="ollama:llama3.3"
GENAISCRIPT_MODEL_RULES="ollama:llama3.3"
GENAISCRIPT_MODEL_RULES1="ollama:llama3.3"
GENAISCRIPT_MODEL_BASELINE="ollama:llama3.3"

# use this command to login daily
# az login --scope api://trapi/.default
10 changes: 5 additions & 5 deletions .github.env
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
GENAISCRIPT_MODEL_LARGE="github:gpt-4o"
GENAISCRIPT_MODEL_EVAL="github:gpt-4o"
GENAISCRIPT_MODEL_RULES="github:gpt-4o"
GENAISCRIPT_MODEL_BASELINE="github:gpt-4o"
PROMPTPEX_MODELS="github:gpt-4o-mini;github:phi-4-mini-instruct"
GENAISCRIPT_MODEL_LARGE="github:openai/gpt-4o"
GENAISCRIPT_MODEL_EVAL="github:openai/gpt-4o"
GENAISCRIPT_MODEL_RULES="github:openai/gpt-4o"
GENAISCRIPT_MODEL_BASELINE="github:openai/gpt-4o"
PROMPTPEX_MODELS="github:openai/gpt-4o-mini;github:microsoft/phi-4-mini-instruct"
81 changes: 81 additions & 0 deletions .github/instructions/implementation.instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Implementation Plan

Hi there fellow Agent,

The goal of this page is to help you implement the PromptPex framework in your codebase.
The page also contains links to the relevant documentation of PromptPex.
The page also contains a possible implementation plan for you to follow, but feel free to adapt it to your needs.

Your task is to implement the PromptPex framework in your codebase according to the requirement of the user. Follow the instructions of the user to choose which language/framework to use, and how to implement the framework in your codebase.

## IMPORTANT

When implementing PromptPex, it is very important to reuse the existing prompts as is. Do not change the prompts unless you need to adapt the syntax for your template engine.

We strongly recommend keep the prompts in separate files and load them
at runtime, rather than hardcoding them in your codebase.

Whenever possible, use the original `.prompty` files from the `src/prompts` directory.

## Phase 1: Test Generation

PromptPex is a test generate framework for prompts. It is made of a graph of LLM transformations that eventually generate a set of
inputs and expected outputs for a given prompt.

- The core of the framework is documented in [Test Generation](docs/src/content/docs/reference/test-generation.md).

## Phase: Validate Test Generation

Once you have implemented the test generation, you should validate it on a prompt.

- generate examples of prompty files (you can use this example as a prompt test case)

```markdown
---
tags:
- unlisted
---

system:
You need to categorize a joke as funny or not.
Respond with "funny" or "not funny".

user:
{{joke}}
```

- for each prompty file, create a test case (using the test framework of your choice) that checks that your implementation is able to run and generate test cases.

- run the tests cases and fix the code until all tests pass.

## Phase 3: Additional features

You will find additional features for promptpex generation that can be useful to add once the basic test generation is working.

- [Groundtruth](docs/src/content/docs/reference/groundtruth.md): how to generate groundtruth for prompts.
- [Test Samples](docs/src/content/docs/reference/test-samples.md): how to integrate existing test samples into the test generation process.
- [Test Expansion](docs/src/content/docs/reference/test-expansion.md): how to morph tests into more complex longer texts.
- [Scenarios](docs/src/content/docs/reference/scenarios.md): how to receive custom set of inputs instructions from the users and use it to guide the generation of tests.

### Notes

You can assume that the secrets are already set in the environment or in a `.env` file
that can be loaded using a library.

## Reference

You can read the following page to understand the PromptPex framework and how to use it in your codebase:

- [Glossary](docs/src/content/docs/reference/glossary.md): A glossary of terms used in the PromptPex framework.
- [Test Generation](docs/src/content/docs/reference/test-generation.md): The core of the framework, how to generate tests for prompts.
- The prompts are `.prompty` files in the [prompts directory](src/prompts).
- The **.prompty** format is documented in [Prompt Format](docs/src/content/docs/reference/prompt-format.md).

## Reference implementation

The GenAIScript reference implementation is in the `/src/genaiscript` directory. PromptPex starts in `src/genaiscript/src/promptpex.mts`.

It is implemented using [GenAIScript](https://microsoft.github.io/genaiscript/).

**Following the patterns and habits of the the target framework/language you are generating**.
The reference implementation is a good starting point but you should adapt it to the target framework/language you are generating.
55 changes: 55 additions & 0 deletions .github/workflows/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: Action Continuous Integration
on:
workflow_dispatch:
push:
branches:
- dev
permissions:
contents: read
models: read
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "22"
cache: npm
# Cache the generated model requests made by GenAIScript
#
# A new cache is created for each run to ensure that the latest model requests are used,
# but previous caches can be restored and reused if available.
- uses: actions/cache@v4
with:
path: .genaiscript/cache/**
key: genaiscript-${{ github.workflow }}-${{ github.run_id }}
restore-keys: |
genaiscript-
- run: npm ci
- run: npm run build
test-action:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Cache the generated model requests made by GenAIScript
#
# A new cache is created for each run to ensure that the latest model requests are used,
# but previous caches can be restored and reused if available.
- uses: actions/cache@v4
with:
path: .genaiscript/cache/**
key: genaiscript-${{ github.workflow }}-${{ github.run_id }}
restore-keys: |
genaiscript-
- uses: ./
with:
prompt: |
system:
Is this joke funny?
user:
{{ input }}
effort: min
github_token: ${{ secrets.GITHUB_TOKEN }}
debug: "script"
23 changes: 0 additions & 23 deletions .github/workflows/build.yml

This file was deleted.

5 changes: 4 additions & 1 deletion .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ on:
- dev
permissions:
contents: write
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
defaults:
run:
working-directory: ./docs
Expand All @@ -20,7 +23,7 @@ jobs:
fetch-depth: 10
- uses: actions/setup-node@v4
with:
node-version: "20"
node-version: "22"
cache: npm
- run: npm ci
- name: Build docs
Expand Down
32 changes: 0 additions & 32 deletions .github/workflows/genai-iat.yml

This file was deleted.

19 changes: 19 additions & 0 deletions .github/workflows/genai-issue-labeller.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: GenAI Issue Labeller
on:
issues:
types: [opened]
permissions:
contents: read
issues: write
models: read
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
genai-issue-labeller:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pelikhan/action-genai-issue-labeller@v0
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
22 changes: 0 additions & 22 deletions .github/workflows/genai-pr.yml

This file was deleted.

20 changes: 20 additions & 0 deletions .github/workflows/genai-pull-request-descriptor.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: GenAI Pull Request Descriptor
on:
pull_request:
types: [opened, reopened, ready_for_review]
permissions:
contents: read
pull-requests: write
models: read
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
generate-pull-request-description:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pelikhan/action-genai-pull-request-descriptor@v0
with:
github_token: ${{ secrets.GITHUB_TOKEN }}

31 changes: 0 additions & 31 deletions .github/workflows/genai-test.yml

This file was deleted.

Loading
Loading