Skip to content

Commit 592b565

Browse files
committed
feat: add support for local LLMs
Adds Ollama support (Gemma 3 models) to the web-codegen-scorer.
1 parent 074eb08 commit 592b565

File tree

12 files changed

+438
-128
lines changed

12 files changed

+438
-128
lines changed

README.md

Lines changed: 39 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,21 @@ Models (LLMs).
55

66
You can use this tool to make evidence-based decisions relating to AI-generated code. For example:
77

8-
* 🔄 Iterate on a system prompt to find most effective instructions for your project.
9-
* ⚖️ Compare the code quality of code produced by different models.
10-
* 📈 Monitor generated code quality over time as models and agents evolve.
8+
- 🔄 Iterate on a system prompt to find most effective instructions for your project.
9+
- ⚖️ Compare the code quality of code produced by different models.
10+
- 📈 Monitor generated code quality over time as models and agents evolve.
1111

1212
Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _web_
1313
code and relies primarily on well-established measures of code quality.
1414

1515
## Features
1616

17-
* ⚙️ Configure your evaluations with different models, frameworks, and tools.
18-
* ✍️ Specify system instructions and add MCP servers.
19-
* 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and
17+
- ⚙️ Configure your evaluations with different models, frameworks, and tools.
18+
- ✍️ Specify system instructions and add MCP servers.
19+
- 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and
2020
coding best practices. (More built-in checks coming soon!)
21-
* 🔧 Automatically attempt to repair issues detected during code generating.
22-
* 📊 View and compare results with an intuitive report viewer UI.
21+
- 🔧 Automatically attempt to repair issues detected during code generating.
22+
- 📊 View and compare results with an intuitive report viewer UI.
2323

2424
## Setup
2525

@@ -40,6 +40,13 @@ export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models
4040
export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models
4141
```
4242

43+
> [!NOTE]
44+
> Web Codegen Scorer supports locals models via Ollama as well. In order to use them, you must have a running Ollama server with the respective model(s) installed. By default, the tool is listening on port `11434` for the server. However, you can change that port by setting the `OLLAMA_PORT` environment variable.
45+
>
46+
> Be aware that using local models might sometimes lead to execution errors due to the output not conforming to our desired format. Unfortunately, this is a present-day limitation of these models. That being said, you can treat the feature as experimental.
47+
>
48+
> Currently supported models: `gemma3:4b`, `gemma3:12b`, `codegemma:7b`
49+
4350
3. **Run an eval:**
4451

4552
You can run your first eval using our Angular example with the following command:
@@ -63,11 +70,11 @@ You can customize the `web-codegen-scorer eval` script with the following flags:
6370

6471
- `--env=<path>` (alias: `--environment`): (**Required**) Specifies the path from which to load the
6572
environment config.
66-
- Example: `web-codegen-scorer eval --env=foo/bar/my-env.mjs`
73+
- Example: `web-codegen-scorer eval --env=foo/bar/my-env.mjs`
6774

6875
- `--model=<name>`: Specifies the model to use when generating code. Defaults to the value of
6976
`DEFAULT_MODEL_NAME`.
70-
- Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>`
77+
- Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>`
7178

7279
- `--runner=<name>`: Specifies the runner to use to execute the eval. Supported runners are
7380
`genkit` (default) or `gemini-cli`.
@@ -77,47 +84,47 @@ You can customize the `web-codegen-scorer eval` script with the following flags:
7784
`.web-codegen-scorer/llm-output` directory (e.g., `.web-codegen-scorer/llm-output/todo-app.ts`).
7885
This is useful for re-running assessments or debugging the build/repair process without incurring
7986
LLM costs for the initial generation.
80-
- **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to
81-
generate the initial files in `.web-codegen-scorer/llm-output`.
82-
- The `web-codegen-scorer eval:local` script is a shortcut for
83-
`web-codegen-scorer eval --local`.
87+
- **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to
88+
generate the initial files in `.web-codegen-scorer/llm-output`.
89+
- The `web-codegen-scorer eval:local` script is a shortcut for
90+
`web-codegen-scorer eval --local`.
8491

8592
- `--limit=<number>`: Specifies the number of application prompts to process. Defaults to `5`.
86-
- Example: `web-codegen-scorer eval --limit=10 --env=<config path>`
93+
- Example: `web-codegen-scorer eval --limit=10 --env=<config path>`
8794

8895
- `--output-directory=<name>` (alias: `--output-dir`): Specifies which directory to output the
8996
generated code under which is useful for debugging. By default, the code will be generated in a
9097
temporary directory.
91-
- Example: `web-codegen-scorer eval --output-dir=test-output --env=<config path>`
98+
- Example: `web-codegen-scorer eval --output-dir=test-output --env=<config path>`
9299

93100
- `--concurrency=<number>`: Sets the maximum number of concurrent AI API requests. Defaults to `5` (
94101
as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`).
95-
- Example: `web-codegen-scorer eval --concurrency=3 --env=<config path>`
102+
- Example: `web-codegen-scorer eval --concurrency=3 --env=<config path>`
96103

97104
- `--report-name=<name>`: Sets the name for the generated report directory. Defaults to a
98105
timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric
99106
characters replaced with hyphens).
100-
- Example: `web-codegen-scorer eval --report-name=my-custom-report --env=<config path>`
107+
- Example: `web-codegen-scorer eval --report-name=my-custom-report --env=<config path>`
101108

102109
- `--rag-endpoint=<url>`: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The
103110
URL must contain a `PROMPT` substring, which will be replaced with the user prompt.
104-
- Example:
105-
`web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>`
111+
- Example:
112+
`web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>`
106113

107114
- `--prompt-filter=<name>`: String used to filter which prompts should be run. By default, a random
108115
sample (controlled by `--limit`) will be taken from the prompts in the current environment.
109116
Setting this can be useful for debugging a specific prompt.
110-
- Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>`
117+
- Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>`
111118

112119
- `--skip-screenshots`: Whether to skip taking screenshots of the generated app. Defaults to
113120
`false`.
114-
- Example: `web-codegen-scorer eval --skip-screenshots --env=<config path>`
121+
- Example: `web-codegen-scorer eval --skip-screenshots --env=<config path>`
115122

116123
- `--labels=<label1> <label2>`: Metadata labels that will be attached to the run.
117-
- Example: `web-codegen-scorer eval --labels my-label another-label --env=<config path>`
124+
- Example: `web-codegen-scorer eval --labels my-label another-label --env=<config path>`
118125

119126
- `--mcp`: Whether to start an MCP for the evaluation. Defaults to `false`.
120-
- Example: `web-codegen-scorer eval --mcp --env=<config path>`
127+
- Example: `web-codegen-scorer eval --mcp --env=<config path>`
121128

122129
- `--help`: Prints out usage information about the script.
123130

@@ -132,11 +139,11 @@ If you've cloned this repo and want to work on the tool, you have to install its
132139
running `pnpm install`.
133140
Once they're installed, you can run the following commands:
134141

135-
* `pnpm run release-build` - Builds the package in the `dist` directory for publishing to npm.
136-
* `pnpm run eval` - Runs an eval from source.
137-
* `pnpm run report` - Runs the report app from source.
138-
* `pnpm run init` - Runs the init script from source.
139-
* `pnpm run format` - Formats the source code using Prettier.
142+
- `pnpm run release-build` - Builds the package in the `dist` directory for publishing to npm.
143+
- `pnpm run eval` - Runs an eval from source.
144+
- `pnpm run report` - Runs the report app from source.
145+
- `pnpm run init` - Runs the init script from source.
146+
- `pnpm run format` - Formats the source code using Prettier.
140147

141148
## FAQ
142149

@@ -166,7 +173,7 @@ Yes! We plan to both expand the number of built-in checks and the variety of cod
166173

167174
Our roadmap includes:
168175

169-
* Including _interaction testing_ in the rating, to ensure the generated code performs any requested
176+
- Including _interaction testing_ in the rating, to ensure the generated code performs any requested
170177
behaviors.
171-
* Measure Core Web Vitals.
172-
* Measuring the effectiveness of LLM-driven edits on an existing codebase.
178+
- Measure Core Web Vitals.
179+
- Measuring the effectiveness of LLM-driven edits on an existing codebase.

package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@
7070
"file-type": "^21.0.0",
7171
"genkit": "^1.19.1",
7272
"genkitx-anthropic": "0.23.1",
73+
"genkitx-ollama": "^1.19.2",
7374
"gpt-tokenizer": "^3.0.1",
7475
"handlebars": "^4.7.8",
7576
"limiter": "^3.0.0",
@@ -78,8 +79,8 @@
7879
"p-queue": "^8.1.0",
7980
"puppeteer": "^24.10.1",
8081
"sass": "^1.89.2",
81-
"stylelint": "^16.21.1",
8282
"strict-csp": "^1.1.1",
83+
"stylelint": "^16.21.1",
8384
"stylelint-config-recommended-scss": "^16.0.0",
8485
"tinyglobby": "^0.2.14",
8586
"tsx": "^4.20.3",

0 commit comments

Comments
 (0)